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SUM^1ARY 


During  the  past  five  years  a number  of  important 
developments  in  the  field  of  narrowband  digital  voice  communi- 
cations have  been  achieved  through  the  sponsorship  of  various 
government  and  Department  of  Defense  agencies.  To  implement 
the  coordination  and  evaluation  of  these  efforts,  a consortium 
of  representatives  of  the  Army,  Navy,  Air  Force,  Defense  Com- 
munications Agency,  National  Security  Agency,  and  Advanced 
Research  Projects  Agency  was  established  by  the  Assistant 
Secretary  of  Defense  (Telecommunications) . The  need  for  valid 
and  reliable  methods  of  predicting  user  acceptance  of  the  various 
narrow  band  .systems  was  recognized  at  the  outset  bv  the  Consortium. 
It  was  acknowledged  that  a high  degree  of  intelligibility,  though 
necessary,  is  not  a sufficient  condition  of  user  acceptance 
Ocher  more  subjective  factors  also  contribute  heavily  to  the 
user's  acceptance  of  a communication  system.  Although  the  tech- 
nology of  intelligibility  measurement  was  already  highly  developed, 
no  comparable  technology  existed  for  evaluating  the  subiective 
aspects  of  the  user's  reaction  to  system  processed  speech.  The 
present  project  was  undertaken  to  meet  the  need  for  such  a tech- 
nology. It  resulted  in  the  development  and  standardization  of 
two  valid,  reliable  and  cost  effective  methods  of  evaluating  the 
"quality"  or  overall  acceptability  of  voice  communication  systems 

The  Paired  Acceptability  Rating  ^lethod  (FARM)  was 
developed  to  serve  both  as  a research  tool  and  as  an  interim- 
method  to  meet  the  imm.ediate  evaluation  needs  of  the  Consortium. 

The  results  of  research  with  FARM  yielded  valuable  information 
concerning  the  major  sources  of  error  in  acceptability  Lett  results 
and  indicated  the  means  to  their  control.  In  particular  these 
results  showed  that  stable  listener  differences  in  subjective 


orip,in  constitute  the  major  source  of  extraneous  variance  in 
acceptability  ratings  and  that  control  of  this  source  can  he 
achieved  through  the  use  of  appropriately  selected  "probe  con- 
ditions." They  showed  further  that  listener  differences  can 
be  most  effectively  evaluated  by  means  of  standard  probe  con- 
ditions located  in  the  midrange  of  the  acceptability  continuum. 

Various  results  of  research  with  FARM  contributed  to 
the  development  of  the  Quality  Acceptance  Rating  Test  (OIJART)  . 
QUART  permits  evaluation  of  the  overall  acceptability  of  a com- 
munication system  and  also  yields  information  regarding  the 
perceptual  qualities  which  determine  the  degree  of  acceptance 
accorded  the  system. 

Research  conducted  with  OUART  has  provided  important, 
if  still  tentative,  Insights  concerning  the  nature  and  number  of 
elementary  perceptual  qualities  that  determine  the  user's  accep- 
tance of  a communication  system.  Subject  to  the  results  of 
additional  research,  QUART  can  yield  predictions  of  acceptability 
based  not  only  on  the  listeners  direct  evaluation  of  accentabil- 
ity,  but  also  on  his  evaluation  of  the  degree  to  which  a system 
is  characterized  by  various  perceptual  qualities.  Such  predic- 
tions will  be  minimally  affected  by  the  personal  "taste"  or 
value  systems  of  individual  listeners  or  samples  of  listeners. 
Q'JART  rating  of  systems  with  respect  to  various  elementary  per- 
ceptual qualities  can  be  expected  to  have  substantial  diagnostic 
value . 

Cross  validation  of  FARM  and  OUART  was  accomplished 
by  correlating  acceptability  ratings  of  representative  systems 
by  a sample  of  communication-involved  military  personnel  with 
FARM  and  QUART  ratings  of  the  same  systems  by  a large  samnle  of 
professional  listeners. 
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HISTORY  OF  THE  PROBLEM 


A number  of  significant  advances  have  taken  place 
in  the  methodology  of  speech  intelligibility  evaluation  during 
the  past  20  years.  These  are  represented  in  particular  by  the 
Fairbanks  Rhyme  Test  (Fairbanks,  1958),  the  Modified  Rhyme 
Test  (House,  et  al , 1965),  and  the  Diagnostic  Rhyme  Test  (Voters, 
1971).  Such  tests,  to  the  extent  that  they  evaluate  the  useful 
information  concent  of  a transmitted  speech  signal,  yield  results 
which  have  important  implications  for  the  overall  acceptability 
of  the  signal. 

Although  intelligibility  is  unquestionably  an  impor- 
tant factor  in  Che  overall  acceptability  of  voice  communication 
systems,  highly  intelligible  speech  may  not  be  acceptable  in 
some  circumstances  of  human  communication.  For  example,  whispered 
speech  (synthetic  or  natural)  can  be  highly  intelligible,  but  is 
essentially  devoid  of  the  properties  normally  connoted  by  the 
term  "quality."  While  possibly  acceptable  in  special  circum- 
stances, whispered  speech  is  obviously  maladapted  to  many  others. 

A need  clearly  exists  for  practical,  scientifically 
valid  methods  of  evaluating  communications  equipment  and  de- 
vices in  terms  of  factors  other  than  speech  intelligibility. 

The  term  "quality"  is  commonly  used  in  reference  to  such  factors, 
variously  including  and  excluding  intelligibility  and  speaker 
recognizability . However,  quality  has  yet  to  be  defined  in  i 
scientifically  rigorous  manner,  which  possibly  accounts  for  the 
fact  that  generally  acceptable  methods  of  evaluating  speech 
"quality"  in  an  engineering  context  have  also  yet  to  be  developed 

It  will  simplify  matters  to  define  the  issue  as  one 
of  overall  system  acceptability,  and  to  address  the  issue  from 


1 


2 


this  point  oi  view.  Once  the  means  of  evaluating  overall 
acceptability  have  been  developed,  it  then  becomes  appropriate 
to  attempt  to  identify  the  perceptual  and  physical  acoustic 
correlates  of  acceptability.  Before  a valid  and  reliable 
measure  of  acceptability  can  be  developed,  however,  several 
issues  must  be  dealt  with.  Among  the  most  important  of  these 
is  the  issue  of  how  the  errors  inherent  in  all  psychophysical 
procedures  are  to  be  controlled.  It  is  appropriate,  therefore, 
that  the  various  types  of  error  and  the  means  of  controlling 
them  be  reviewed  at  the  outset  . 

1 1 The  Control  of  Measurement  Error 

1.1.1  Random  Sampling  Errors  - A diversity  of  random  effects 
are  potentially  operative  in  the  acceptability  evaluation  situa- 
tion. However,  four  major  sources  of  random  variation  most 
generally  account  for  the  bulk  of  the  practically  significant 
random  variation  in  test  results.  Crossly,  they  can  be  identi- 
fied as  interindividual  listener  differences,  intraindividual 
listener  differences,  interindividual  speaker  differences  and 
intraindividual  sneaker  differences.  Of  these,  intraindividual 
speaker  differences  are  of  least  immediate  concern,  since  the  use 
of  recorded  speech  materials,  combined  with  systematic  selection 
of  these  materials  provides  rigorous  control  of  this  factor. 

The  others,  however,  merit  more  extensive  consideration. 

1 . 1 . 1 . 1 Sampling  Errors  Attributable  to  Interlistener  Variation  - 
Listener  factors,  both  systematic  and  random,  are  potential  sources 
of  error  in  any  psychoacoustical  experiment  or  test.  Their  impact 
upon  test  results  is  likely  to  be  especially  significant  where  a 
li.stener's  rating  or  judgment  of  a stimulus  property  is  in  some 
degree  a matter  of  personal  taste  or  preference.  Other  things 
equal,  methods  of  acceptability  evaluation  which  solicit  a direct 
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expression  of  the  listener's  acceptance  or  preference  will  tend 
to  be  particularly  susceptible  to  random  sampling  error  asso- 
ciated with  listeners.  The  most  direct  means  of  reducing  this 
component  of  evaluation  error  is  by  increasing  the  size  of  the 
listener  sample,  but  there  are  other  means  of  reducing  listener 
sampling  error.  Individual  differences  in  response  tendency 
may  be  independently  evaluated  to  provide  a statistical  basis 
for  the  adjustment  of  data  yielded  by  "deviant"  subjects.  For 
example,  a listr.ier's  ratings  of  a standard  set  of  reference 
conditions  can  be  used  to  determine  the  extent  of  his  tendency 
to  rate  more  leniently  or  stringently  than  the  typical  or  nor- 
mative subject.  His  responses  to  experimental  conditions  may 
then  be  adjusted  accordingly. 

1 . 1 . 1 . 2 Sampling  Error  Attributable  to  Intralistener  Variation 
Errors  of  significant  magnitude  may  arise  from  random  variation 
in  the  response  characteristics  of  a given  listener.  This  type 
of  variation  can  be  reduced  by  replication  in  accordance  with 
well-defined  statistical  principles.  As  in  the  case  of  inter- 
listener differences,  however,  seemingly  random  errors  may  have 
systematic  origins.  Depending  upon  tho  nature  of  the  listener's 
task,  factors  such  as  fatigue,  habituation,  and  learning,  may 
contribute  to  intralistener  variation  in  an  acceptability  rating 
situation  Generally,  however,  such  effects  are  amenable  to 
experimental  control  through  careful  experimental  design. 


1 . 1 . 1 . 3 


Sampling  Error  Attributable  to 
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diction  of  system  acceptability.  Unfortun 
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vocoders  in  particular,  are  quite  sensitive  to  speaker  differences 
in  pitch  (Voiers  and  Smith,  1972),  and  to  other  yet- to-be-iden- 
tified  speaker  characteristics  (Voiers,  e^  1973)  insofar  as 
speech  intelligibility  is  concerned.  But  it  remains  to  be  deter- 
mined that  the  individual  speech  characteristics  on  which  other 
aspects  of  acceptability  depend  are  subject  to  the  interaction  of 
speaker  and  system  characteristics. 

1.1. 1.4  Sampling  Error  Attributable  to  Intraspeaker  Variation  - 
It  has  been  observed  by  many  investigators  that  the  intelligibil- 
ity of  an  individual's  speech  varies  with  a ntimber  of  factors,  for 
example  with  level  of  vocal  effort  (Williams,  e^  al,  1966).  In- 
asmuch as  intelligibility  is  an  important  condition  of  overall 
acceptability,  it  is  to  be  expected  that  system  acceptability 
measurements  will  be  subject  to  some  degree  of  variation  with 
intraindividual  speech  variation.  Ultimately  some  consideration 
should  be  given  to  this  issue  in  determining  the  suitability  of 
a sy.stem  in  the  operational  situation,  though  resolution  of  this 
issue  is  beyond  the  scope  of  the  present  project.  While  the 
effects  of  intraindividual  speech  variation  are  not  systematically 
investigated,  here,  they  are  rigorously  controlled  by  the  choice 
of  speech  materials  used,  by  instructions  to  the  speakers,  and, 
more  generally,  by  the  circumstances  of  the  recording  situation. 

1.1.2  Adaptation  Level  Variation  and  Systematic  Error  - 

Helson  (1959)  has  shown  that  much  of  the  extraneous  variation 
observed  in  the  results  of  psychophysical  experiments  is  ultima- 
tely attributable  to  variation  in  the  individual's  adaptation 
level  (AL)  for  simple  or  complex  stimulus  properties.^  His 


’ "Adaptation  Level”  is  used  in  a relatively  loose  sense  through- 
out this  report.  Certain  systematic  shifts  can  occur  in  the 
range  of  a listener's  responses  as  a result  of  factors  other 
than  true  adaptation  level  changes.  In  the  case  of  ratings  of 
system  acceptability,  such  differences  may  result  from  different 
conceptions  of  the  communication  situation,  which  factor  may 
account  for  observed  systematic  differences  between  ratings  by 
professional  listeners  and  by  system  users  who  are  more  familiar 
with  the  circumstances  under  which  a system  under  evaluation 
might  be  actually  used. 
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judgment  of  the  brightness  of  a light,  the  heaviness  of  a lifted 
weight  or  the  loudness  of  a sound  is  directly  dependent  on  his 
adaptation  level  or  subjective  origin  for  each  of  the  stimulus 
properties  involved.  Thus,  individual  differences  in  the  response 
to  a given  stimulus  event  can  in  many  cases  be  explained  on  the 
basis  of  individual  differences  in  adaptation  level  for  the 
relevant  stimulus  property  or  properties. 

In  summary,  adaptation  level  phenomena  have  important 
implications  for  the  precision  of  methods  for  evaluating  speech 
acceptability,  particularly  where  absolute,  as  well  as  relative, 
measurements  of  acceptability  are  involved.  On  one  hand,  residual 
AL  shifts  may  contribute  to  interlistener  variation.  On  the 
other  hand,  transient  or  intra-experimental  shifts  may  increase 
intralistener  response  variation. 

1 . 2 State  of  the  Art  in  Acceptability  Evaluation 

Other  investigators  who  have  dealt  with  the  problem 
of  speech  acceptability  or  "quality”  evaluation  have  been 
sensitive  to  the  error  phenomena  discussed  in  the  previous 
section,  and  the  solutions  they  have  offered  generally  reflect 
special  concern  with  one  or  several  of  these  types  of  error. 

The  isopreference  method  of  Munson  and  Karlin  (1962) 
represents  a major  contribution  to  the  study  of  acceptability 
evaluation.  In  this  method,  both  a variable  test  parameter 
(loudness)  and  a variable  reference  signal  (high  fidelity  speech 
and  additive  random  noise)  are  used  in  a forced  pair  comparison 
task.  The  method  yields  a set  of  isopreference  contours  enclos- 
ing an  area  which  represents  the  optimum  setting  of  the  test 
system  with  respect  to  loudness  and  noise  level.  From  the  set 
of  isopreference  contours,  a "transmission  preference  level" 
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is  determined  for  the  test  signal,  that  level  being  simply  the 
signal-to-noise  ratio  (S/N)  of  the  reference  signal  that  is 
isopreferent  to  the  test  signal. 

Among  the  desirable  features  of  the  isopreference 
method  are  high  reliability,  unidimensionality  of  results,  and 
the  use  of  a physical  reference  scale.  The  method  provides 
extremely  rigorous  control  of  adaptation  level.  It  is,  however, 
somewhat  maladapted  for  use  in  circumastances  which  involve  other 
than  the  simplest  types  of  signal  degradation.  The  use  of 
additive  random  noise  as  the  method  of  signal  degradation  may 
serve  among  other  things  to  invite  judgments  of  S/N  ratio  rather 
chan  of  overall  acceptability. 

Rothauser,  ^ aj^,  (1967)  developed  a modification  of 
the  isopreference  method  in  which  only  the  reference  signal  is 
varied.  This  modification  is  substantially  simpler  to  implement 
than  the  original  method.  It  involves  a preliminary  test  to 
determine  both  the  optimum  loudness  for  test  signal  presenta- 
tion and  the  range  of  S/N  ratios  for  the  reference  signals  and 
uses  Che  S/N  ratio  at  the  point  of  isopreference  as  its  indicant 
of  speech  acceptability.  An  assumption  underlying  the  Rothauser 
modification  is  that  speech  "preferability"  varies  as  a mono- 
tonic function  of  S/N.  The  use  of  a simple  reference  for  pre- 
fii/j’uiliLy  measurements,  i.e.  , noise-degraded  speech,  is  desir- 
able in  that  the  standard  can  be  easily  described  and  reproduced 
by  other  laboratories.  But,  as  in  the  Munson-Karlin  method,  the 
danger  exists  that  subjects  will  tend  to  assume  that  their  judg- 
ments are  to  be  based  primarily  on  the  noisiness  of  the  system 
under  test  rather  than  on  the  totality  of  its  subjectively 
relevant  characteristics  Individual  differences  in  listener 
preference  characteristics  remain  a major  obstacle  to  the  gen- 
eralization of  results,  as  the  developers  of  this  method  acknow- 
ledge 
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The  relative  preference  method  (Hecker  and  Williams, 
1966)  uses  several  fundamentally  different  types  of  distorted 
speech  as  references,  specifically:  peak  clipped  and  band- 
passed  speech  with  reverberant  echo,  lowpassed  speech  combined 
with  lowpassed  white  noise,  bandpassed  speech,  and  high  fidelity 
speech.  In  a typical  test  run,  the  test  system  is  compared  with 
each  reference  condition,  and  the  reference  conditions  are  com- 
pared with  each  other.  From  the  comparisons  among  reference 
conditions,  a ten-point  preferability  scale  is  constructed.  Then, 
from  the  comparisons  involving  the  test  system  and  each  of  the 
reference  conditions,  a preterabi lity  rating  (1  to  10)  is  deter- 
mined for  the  test  system.  It  should  be  noted,  however,  that 
the  coarseness  with  which  the  reference  systems  are  scaled  may 
be  detrimental  to  the  efficiency  and  precision  of  the  method. 

The  evaluation  of  any  one  system  becomes  effectively  a function 
of  degree  to  which  the  test  system  is  preferred  to  a single 
reference  condition.  For  example,  a fairly  high  quality  system 
will  quite  possibly  be  preferred  to  the  lowest  three  reference 
conditions  in  all  comparisons  involving  them.  Likewise,  it  will 
always  be  judged  less  preferable  than  the  highest  reference 
condition  (high  fidelity  speech).  In  this  circumstance,  the 
preference  value  assigned  the  system  under  evaluation  may  depend 
primarily  on  the  frequency  with  which  it  is  judged  to  be  prefer- 
able to  the  fourth  reference  condition  alone,  which  condition 
involves  not  only  a particular  degree  but  a particular  type  of 
degradation.  Moreover,  the  confounding  of  degree  and  type  of 
degradation  in  the  reference  signals  invites  a diversity  of 
artifacts,  the  full  implications  of  which  have  yet  to  be  eval- 
i'  ited.  The  relative  preference  method  would  in  any  case  appear 
to  make  extremely  inefficient  use  of  the  listener's  time  and  of 
the  data  he  yields. 
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The  unit  variance  method  of  Voiers  , ^ al^  (1965) 
incorporates  a number  of  novel  theoretical  and  practical  features, 
but  was  designed  primarily  to  cope  with  a limited  class  of  sys- 
tems (vocoders)  and  could  not,  without  some  modification,  be  used 
with  other  types  of  systems.  Tt  is,  in  any  case,  extremely  cum- 
bersome to  prepare,  administer,  and  score.  Moreover,  it  shares 
with  ocher  "isometric”  methods  a susceptibility  to  sampling  error 
associated  with  listeners. 

A simplified  pair  comparison  method  described  by 
Coulter  (1974)  appears  to  provide  relatively  reliable  rankings 
of  systems.  Like  other  pair  comparison  methods,  however,  it  is 
maladapted  to  situations  involving  conditions  of  widely  disparate 
acceptability.  Like  the  unit  variance  method,  it  involves  an 
extremely  tedious  process  for  the  preparation  of  test  materials. 

Distinct  from  the  relative  or  preference  methods  are 
the  absolute  methods,  several  of  which  (Richards  and  Swaffield, 
1959;  Rothauser,  ^ ^1,  1971,  Grether  and  Stroh,  1972)  may  be 
discussed  as  a group,  since  they  share  a number  of  crucial  fea- 
tures. In  all  of  the  variations  of  this  method  the  subject  is 
directed  to  describe  his  impressions  of  the  acceptability  of  the 
speech  test  signal  in  terms  of  a set  of  ordered  categories. 

Typical  category  labels  are  "Unsatisfactory,”  "Poor,"  "Fair," 
"Good,"  and  "Excellent."  Some  variations  of  the  basic  method 
involve  a continuous  scale  on  which  selected  points  are  labeled, 
others  provide  the  subject  with  examples  of  the  extreme  categories 
in  order  to  "anchor"  his  subjective  scale;  still  others  present 
the  subject  with  cither  all,  or  a representative  .sample,  of  the 
test  signals  in  order  to  orient  him  to  the  relevant  range  of 
qua  1 i t ies 
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The  absolute  preference  methods  are  often  charac- 
terized by  low  reliability,  presumably  due  to  interindividual 
differences  in  preferred  characteristics,  subjective  scaling 
factors,  and  adaptation  level  or  subjective  origin.  Given 
adequate  control  of  these  variables,  however,  the  absolute 
methods  have  a number  of  theoretical  advantages  in  addition 
to  the  practical  advantages  of  simplicity  and  economy.  In 
particular,  they  yield  "absolute"  rather  than  relative  measures 
of  acceptability. 

An  investigation  by  McDermott  (1969)  contributed 
significantly  to  the  iriethodology  of  speech  acceptability  eval- 
uation. In  this  investigation,  preference  data  and  similarity 
judgments  were  obtained  from  relatively  large  samples  of  listeners 
for  a set  of  21  speech  transmission  conditions.  The  results 
demonstrated  the  feasibility  of  predicting  preferability  or 
acceptability  from  judgments  made  with  respect  to  other  sub- 
jective dimensions,  a number  of  which  were  involved  in  judgments 
of  similarity.  An  especially  significant  aspect  of  this  demon- 
stration was  the  finding  that  similarity  data,  unadjusted  for 
listener  idiosyncrasies,  could  be  used  to  predict  the  results 
of  preference  judgments  which  were  statistically  adjusted  for 
listener  Idiosyncrasies.  This  finding  suggests  the  means  of 
circumventing  what  is  perhaps  the  most  formidable  obstacle  to 
the  development  of  valid,  practical  methods  of  acceptability 
evaluation;  the  elementary  fact  that  listeners  tend  far  more 
to  agree  on  what  they  hear  than  on  how  well  they  like  what  they 
hear . More  importantly,  McDermott’s  results  raise  the  possibil- 
ity that  measurements  of  what  individuals  perceive  to  be  the 
distinguishing  features  of  processed  or  transmitted  speech  can 
serve  as  valid  bases  for  the  prediction  of  acceptability  by 
listeners,  independently  of  the  values  placed  on  these  features 
by  the  individual  listener. 


2.0 


BA.'JTC  APPROACHES  TO  THE  PROBLEM- -PROPOSED  SOLUTIONS 


2 . 1 Basic  Approaches 

In  light  of  McDermott's  results,  it  appears  that 
the  problem  of  predicting  system  acceptability  can  be  solved 
in  more  than  one  way.  Two  basic  approaches  can  be  distin- 
guished . 

2.1.1  Isometric  Approach  to  Acceptability  Evaluation  - 
One  approach  to  acceptability  evaluation  is  the  "isometric" 
approach,  in  which  an  evaluative  or  affective  reaction  is 
directly  solicited  from  the  listener.  The  validity  of  this 
approach  rests  heavily  on  the  assumption  of  representative 
sampling--the  assumption  that  the  listener  sample  is  represen- 
tative, both  qualitatively  and  quantitatively  of  the  population 
of  interest  from  the  standpoint  of  personal  preferences  or  tastes. 
To  Che  extent  that  a listener  sample  values  the  same  perceived 
system  qualities,  and  to  the  same  degree,  as  the  typical  member 
of  the  population  of  interest,  accurate  prediction  of  the  accep- 
tance reactions  of  that  population  can  be  achieved  with  the  iso- 
metric approach.  To  the  extent  that  the  value  systems  of  the 

two  groups  differ,  predictions  based  on  isometric  data  will 
necessarily  be  less  accurate. 

2.1.2  Parametric  Approach  to  Acceptability  Evaluatioii  - 
A second  approach  is  the  "parametric"  approach  in  which  the 
experimental  listener's  perception,  rather  than  his  evaluation 
of  a system  or  condition  is  used  as  a basis  for  predicting  the 
acceptance  reactions  of  the  population  of  interest.  The  validity 
of  the  parametric  approach  rests  on  two  assumptions: 

1.  That  whatever  their  various  preferences  with 
respect  to  the  perceptual  qualities  of  trans- 
mitted speech,  the  experimental  listener  sample 
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and  the  population  of  interest  have  in  common 
the  capacity  for  discriminating  these  qualities. 

2.  That  correlation  exists--at  the  normative,  if 

not  the  individual,  level- -between  the  perceived 
characteristics  of  transmitted  speech  and  degree 
of  acceptance  by  the  population  of  interest. 

It  follows  from  these  assumptions  that  even  the 
listener  who  does  not  value  (or  negatively  values)  the  percep- 
tual qualities  most  valued  by  the  population  of  interest  can 
provide  information  concerning  the  degree  to  which  an  experi 
mental  speech  signal  is  characterized  by  those  qualities  :h 

information  can,  in  turn,  be  u.sed  to  predict  the  acceptan  c 
reactions  of  the  population  of  interest. 

Prerequisites  of  the  development  of  a parametric 
method  of  acceptability  prediction  are  (1)  the  development  of 
means  of  measuring  the  relevant  perceptual  qualities  and 
(2)  the  determination  of  relations  between  these  qualities  and 
the  evaluative  or  affective  reactions  of  the  user  population. 

2 . 2 Proposed  Solutions 

To  meet  both  the  near-term  and  longer-term  needs  of 
DCA  Narrowband  Voice  Consortium,  both  the  above  approaches  were 
experimentally  investigated.  The  end  products  of  these  inves- 
tigations were  the  Paired  Acceptability  Rating  Method  (PARM)  and 
the  Quality  Acceptance  Rating  Test  (QUART). 
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2.2.1  Paired  Acceptability  Rating  Method  (FARM)  - FARM  is 

a state-of-the-art  method  which  utilizes  the  isometric  approach. 

It  was  initially  conceived  to  serve  as  an  interim  method  in  order 
to  meet  an  immediate  practical  need.  As  such,  it  presents  a 
number  of  the  problems  typical  of  isometric  evaluation  methods, 
but  it  is  designed  to  permit  rigorous  control  and  the  evaluation 
of  the  major  types  of  error  commonly  encountered  in  psychophysical 
experiments.  The  information  it  has  yielded  regarding  the 
relative  magnitudes  of  the  various  types  of  systematic  and  random 
error  has  resolved  a number  of  issues  regarding  the  optimal 
design  of  acceptability  tests  from  the  standpoints  of  scientific 
validity  and  cost  effectiveness.  The  availability  of  such  infor- 
mation greatly  facilitated  the  refinement  of  FARM  ar:".  che  devel- 
opment of  the  Quality  Acceptance  Rating  Test.  FARM  will  undoubtedly 
contrlbct;e  to  fv’rther  rctineme.  ts  in  the  technology  of  accept- 
ability evaluation. 

2.2.2  Quality  Acceptance  Rating  Test  (QUART)  - '"JART  utilizes 
a combination  of  the  Isometric  and  parametric  approaches , but  was 
designed,  subject  to  the  results  of  further  research  and  develop- 
ment, to  function  entirely  as  a p'»rametric  method  of  predicting 
user  acceptance.  It  solicits  an  evaluative  response  from  the 
listener,  but  a.’.so  requires  him  to  characterize  a system-condition 
in  terms  of  various  perceptual  qualities. 

Both  methods  have  been  validated  against  a set  of 
criterion  data  yielded  by  a large  sample  of  operational  commu- 
nications personnel  drawn  from  the  Air  Force,  Navy,  and  Army. 

Details  of  these  validation  studies  are  described  in  subsequent 
chapters,  following  a description  of  the  criterion  data  and  the 
method  of  its  collection. 


3.0 


VALIDATION  OF  ACCEPTABILITY  EVALUATION  METHODS 


It  is  commonly  observed  that  the  acceptability  of 
processed  speech  depends  upon  the  experience,  orientation  and 
needs  of  the  listener.  Thus  the  reactions  of  the  communica- 
tions engineer  ^ho  is  heavily  involved  in  the  development  of  a 
speech  processing  or  transmission  technique  are  often  found  to 
be  quite  different  from  those  of  the  casual  listener  or  the 
potential  system  user.  It  is  extremely  important  to  insure 
that  the  results  yielded  by  any  acceptability  evaluation  method 
permit  valid  predictions  of  the  reactions  of  the  population  of 
individuals  who  will  use  a system  or  device  in  the  operational 
situation.  It  is  essential,  therefore,  that  the  correlation 
between  the  reactions  of  laboratory  listeners  and  potential 
system  users  be  known.  To  permit  the  determination  of  this 
correlation,  a survey  was  undertaken  in  '.’hich  a large  sample  of 
potential  system  users  was  presented  speech  materials  as  pro- 
cessed by  various  state-of-the-art  narrowband  and  broadband 
voice  communication  systems.  Both  the  affective  and  perceptual 
reactions  of  the  "target  sample"  to  these  systems  were  solicited, 
using,  among  other  things,  the  QUART  Raring  Form  described  in 
Chapter  5. 

3 . 1 Collection  of  Validation  Data 

3.1.1  The  Targe  Sample  - A total  of  approximately  130 
military  and  civil  service  personnel,  all  of  whom  were  potential 
users  of  military  communications  equipment  and  systems,  partici- 
pated in  the  survey.  From  the  total  somewhat  heterogeneous 
sample  of  available  respondents,  a relatively  Fiomogeneous  sub- 
sample of  90  respondents  was  segregated  for  purposes  of  validating 
FARM  and  QUART.  Only  male  military  personnel,  both  officers  and 
enlisted  men,  were  included  in  the  final  sample.  All  had  survived 
various  informal  checks  for  understanding  of  the  task  and  for  self 
consistency  in  performing  the  task. 
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3.1.2  Data  Collection  from  the  Target  Sample  - Following 
a brief  explanation  of  the  purposes  of  the  survey,  and  of  the 
nature  of  this  task,  Target  Sample  respondents  wert-  presented 
the  following  materials  to  which  they  responded  as  indicated. 


Speech  Materials 

One-sentence  sample  of  each 
of  26  laboratory  and  system 
conditions  as  spoken  by  each 
of  three  male  speakers. 


Twelve -sentence  sample  of 
each  of  26  laboratory  and 
system  conditions  as  spoken 
by  one  male  speaker  (CH  or 
LL)  . 


One-sentence  sample  of  each 
laboratory  and  system-con- 
dition, as  above. 


Twelve-sentence  sample  of 
each  laboratory  and  system- 
condition  as  above,  but 
spoken  by  alternate  male 
speaker  (CH  or  LL) . 


One-sentence  sample  of  each 
laboratory  and  system-con- 
dition, as  above. 


Response 

Yes  or  no  response  to  the 
question:  "Would  transmis- 

sion of  this  quality  be 
generally  acceptable  for 
purposes  of  routine  commu- 
nications in  the  job  you 
presently  perform?" 


Rating  of  each  system- 
condition  on  12  perceptual 
qualities  plus  rating  of 
acceptability  on  a 100  point 
scale . 


Yes  or  no  response  to  the 
question;  "Would  transmis- 
sion of  this  quality  be  at 
least  minimally  tolerable 
for  purposes  of  routine  com- 
munications  in  the  job  you 
presently  perform?" 

Rating  of  each  system  on  per- 
ceptual qualities  and  accept- 
ability as  above. 


Yes  or  no  response  to  the 
question;  "Would  transmis- 
sion of  this  quality  suffice 
at  least  for  purposes  of 
emergency  communications  in 
the  job  you  presently  perform?" 
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Data  obtained  by  the  foregoing  procedures  are  ulti- 
mately of  interest  from  several  points  of  view  and  are  dis- 
cussed more  fully,  elsewhere.  Most  immediately,  however,  they 
are  of  interest  for  purposes  of  validating  FARM  and  QUART  as 
used  with  "professional"  listeners.  In  this  connection  two 
classes  of  results  are  of  greatest  relevance.  These  are,  first, 
the  results  based  on  the  respondents'  binary  judgments  of  sys- 
tem acceptability  and,  secondly,  the  results  obtained  from  the 
respondents'  ratings  of  the  various  laboratory  and  system- 
conditions.  The  development  of  appropriate  criterion  measures 
from  these  results  is  the  primary  issue  to  which  this  section 
is  addressed. 

3 . 2 Selection  of  an  Acceptability  Criterion  Measure 

The  ultimate  concern  of  a using  agency  is  to  determine 
the  proportion  of  the  user  population  for  which  a system  equals 
or  exceeds  some  level  of  acceptability.  On  the  face  of  it,  there- 
fore, one  potential  criterion  of  system  acceptability  is  provided 
by  F(A),  the  estimated  proportion  of  the  user  population  for 
which  a given  communication  system  or  condition  is  considered 
generally  acceptable  for  purposes  of  routine  communication.  How- 
ever, F(A)  has  several  shortcomings  which  limit  its  usefulness 
and  validity  in  this  application.  Most  obvious  is  that  F(A) 
provides  no  discrimination  of  relative  acceptability  for  systems 
which  are  found  acceptable  or  unacceptable  by  the  entire  sample 
of  listeners  or  respondents  involved  in  a given  evaluation.  It 
permits  no  distinction  between  two  or  more  systems  of  sufficient 
but  differing  degree  of  acceptability.  More  generally,  F(A) 
permits  precise  evaluation  of  relative  acceptability  only  over  a 
relatively  narrow  range  of  the  acceptability  continuum  and  fails 
to  provide  adequate  discrimination  at  one  or  both  extremes  of 
the  continuum. 
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The  major  underlying  reason  for  F (A) ' s limitations 
as  an  acceptability  criterion  is  familiar  to  statisticians  in 
the  behavioral  and  biological  sciences,  and  becomes  evident 
when  one  examines  the  relevant  statistical  principle.  Given 
the  assumption  that  individual  acceptance  thresholds  with 
respect  to  one  or  more  underlying  perceptual  continue  tend  to 
be  normally  distributed,  F(A)  then  represents  an  estimate  of: 


where  P(A)  is  the  proportion  of  the  user  population  for  which 
the  system-condition  is  acceptable  and  x is  the  position  of  a 
system-condition  on  an  underlying  psychological  continuum. 

It  is  to  be  expected  that  x can  be  closely  approxi- 
mated by  the  average  (or  a linear  transformation  thereof)  of  a 
sample  of  listener  acceptability  ratings  R(A) . Figure  3.1 
confirms  this  expectation,  where  F (A)  is  seen  to  have  the 
expected  sigmoidal  relation  to  R(A) , average  acceptability 
rating.  Specifically,  F (A)  is  the  median  (for  three  male 
speakers)  percentage  of  Target  Sample  members  who  indicated 
general  acceptance  of  a system  for  routine  voice  communications 
and  R(A)  is  the  average  acceptability  rating  (on  a scale  of 
0-100)  assigned  the  system  by  the  same  sample  of  respondents. 
(Since  most  of  the  system-conditions  were  found  minimally  accept- 
able for  emergency  use,  data  with  respect  to  these  criteria  are 
of  limited  value  in  the  present  application.  No  further  use 
was  made  of  them  for  purposes  of  this  investigation.)  The 
curve  shown  in  Figure  3.2  was  obtained  from  the  regression  of 
T(A)  on  R(A) , T(A)  being  the  corresponding  normal  deviate  (with 
arbitrary  mean  of  50  and  standard  deviation  of  21.48)  for  each 
of  the  obtained  values  of  F(A). 
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Fig.  3.2  Transformed  "Percent  Acceptance" 
as  a Function  of  Acceptability  Rating 
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In  view  of  the  high  correlation  which  R(A)  exhibits 
with  T(A),  and  of  its  other  desirable  properties--high  reli- 
ability, sensitivity  of  system  differences  over  the  full  range 
of  the  acceptability  continuum,  adaptability  to  use  with  smell 
samples,  and  Gaussian  di3tribution--R(A)  is  clearly  the  best 
choice  as  a criterion  of  system  acceptability  to  the  target 
sample.  Accordingly  it  is  used  as  the  primary  basis  for  the 
cross-validation  of  FARM  and  QUART. 


4.0 


INVESTIGATION  OF  THE  ABSOLUTE  RATING  APPROACH  TO 
ACCEPTABILITY  EVALUATION 


Most  methods  of  comparing  voice  communications 
systems  from  the  standpoint  of  speech  quality  or  acceptability 
have  been  derived  in  one  way  or  another  from  the  classical 
"Method  of  Pair  Comparisons"  (Guilford,  1954).  However, 
practical  considerations  of  time  and  economy  have  usually 
precluded  the  use  of  procedures  which  take  full  advantage  of 
the  potential  power  and  sensitivity  of  this  method.  The 
classical  method  requires  a single  judge  or  subject  to  make 
many  comparisons  (i.e. , 100  or  more)  of  each  member  of  all  pos- 
sible pairs  of  stimuli  or  conditions  under  evaluation.  Alter- 
natively. the  method  can  be  adapted  for  use  with  a great  many 
subjects  (i.e. , 100  or  so),  each  of  whom  judges  each  pair  of 
conditions  only  once. 

Although  variations  of  the  method  have  been  developed 
to  cope  with  the  case  of  multiple  judgments  by  multiple  judges, 
these  variations  are  somewhat  cumbersome  to  use  and  yield 
results  that  cannot  easily  be  generalized  to  the  population  of 
interest.  In  particular,  these  methods  are  poorly  suited  for 
use  in  circumstances  involving  small  crews  of  judges  or  subjects 
and  small  numbers  of  judgments  by  each  subject.  No  matter  how 
precisely  the  reactions  of  a small  panel  of  judges  are  evaluated 
the  size  of  the  panel  remain.s  the  major  determinant  of  the  gener 
ality  of  the  results. 

In  the  major  variants  of  the  classical  method,  the 
judge's  task  is  simply  to  order  the  members  of  each  pair  of 
conditions  with  respect  to  some  physical  or  psychological  con- 
tinuum such  as  frequency,  loudness,  brightness,  or  aesthetic 
acceptability.  The  binary  data  generated  by  this  procedure  are 
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normally  subjected  to  a transformation  (e.g,,  "pht-gamma"  or 
arc  sin)  designed  to  place  all  of  the  systems  under  considera- 
tion on  an  equal  interval  scale,  the  unit  of  which  is  based 
on  intra-  or  inter-subject  "dtscrlminal  dispersion,"  or  other 
unit  of  psychological  distance.  Such  transformations  are 
feasible,  however,  only  when  relatively  large  numbers  of  judg- 
ments (say,  greater  than  100)  are  made  by  each  judge  for  each 
pair  of  conditions.  Normally,  such  scales  have  arbitrary 
origins  and  are  thus  not  ratio-preserving. 

Some  simplication  of  the  pair  comparison  method  can 
be  achieved  by  the  sacrifice  of  the  equal  interval  property, 
as,  for  example,  where  the  figure  of  relative  merit  is  simply 
the  percent  of  time  that  each  system  or  condition  is  preferred. 
With  such  figures  of  merit,  only  the  ordinal  properties  of  the 
acceptability  scale  are  preserved  (i.e,,  scale  values  are  not 
linearly  related  to  the  underlying  scale  of  acceptability) . In 
any  case,  the  pair  comparison  method  in  all  vatijtions  is 
optimally  suited  for  comparative  evaluation  of  relatively  similar 
conditions.  Somewhat  arbitrary  procedures  must  be  resorted  to 
in  scaling  widely  disparate  conditions,  particularly  where  one 
condition  is  universally  favored  or  rejected.  The  classical 
method  and  its  major  variants  are,  as  such,  not  optimally  adapted 
for  the  evaluation  of  systems  or  conditions  from  an  absolute 
standpoint . 


Outside  information  is  normally  necessary  to  trans- 
form relative  values  obtained  from  pair  comparison  data  to 
values  on  an  absolute  scale  which  has  a psychologically  meaning- 
ful zero,4>aint.  One  means  of  effecting  this  transformation  is 
to  employ  some  of  the  absolute  rating  procedures  in  which  each 
condition  of  interest  is  judged  in  isolation  using  two  or  more 
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ordered  categories,  e.g.,  llke-dislike . Since  data  yielded  by 
absolute  judgments  or  ratings  can  themselves  be  used  to  scale 
stimuli,  use  of  the  pair  comparison  method  for  purposes  of 
routine  evaluation  of  system  acceptability  would  seem,  at  best, 
to  provide  an  uneconomical  solution. 

The  absolute  racing  approach  has  several  features  to 
recommend  it  for  present  purposes . Although  often  regarded  as 
intrinsically  less  reliable  chan  various  comparative  methods,  the 
absolute  methods  can  greatly  simplify  the  scaling  problem.  There 
is,  moreover,  the  possibility  that  the  seemingly  poor  reliability 
of  absolute  racings  derives  from  potentially  controllable  factors, 
in  particular,  interindividual  differences  and  intraindividual 
shifts  in  adaptation  level.  This  was  a major  consideration  in 
Che  design  and  development  of  FARM. 

There  is  little  question  that  AL  phenomena  are  oper- 
ative in  any  speech  rating  situation  and  may  give  rise  to 
significant  variation  in  listener  performance.  What  remained 
to  be  determined  in  the  present  case,  were  the  practical  impli- 
cations of  the  various  components  of  AL.  A major  part  of  the 
research  described  in  the  following  sections  was  addressed 
directly  or  indirectly  to  this  issue. 

4.1  Development  of  the  Paired  Acceptability  Rating 

Hitho'd  (PaRM)  

FARM  was  designed  to  provide  a practical,  reliable, 
and  valid  method  for  relative  and  absolute  evaluation  of  the 
acceptability  of  voice  communications  systems.  It  is  an  abso- 
lute rating  method,  but  it  utilizes  a format  that  permits  com- 
parative evaluation  of  experimental  systems  or  conditions. 

Each  system-condition  to  be  evaluated  is  presented  under  cir- 
cumstances In  which  the  listener  has  the  opportunity,  if  so 
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directed,  to  compare  it  (in  two  temporal  orderings)  with  every 
other  experimental  condition  involved,  and  with  one  or  several 
"anchors"  or  reference  conditions.  For  the  purposes  of  FARM, 
however,  listeners  were  not  asked  to  make  comparative  ratings. 

The  temporal  ordering  of  conditions  was  designed  to  provide 
uniformity  of  context,  as  represented  in  particular,  by  the 
immediately  preceding  condition. 

A . 2 Experimental  Evaluation  of  FARM 

4.2.1  Materials,  Method  and  Procedures  - The  test  materials 

comprising  FARM  consist  of  a master  corpus  of  six-syllable, 
phonemically  controlled  sentences  (see  Appendix  A)  from  which  a 
sample  , or  subset , is  drawn  for  purposes  of  a given  test  admin- 
istration. Although  the  number  of  experimental  conditions  and 
the  number  of  speakers  may  be  varied  at  the  experimenter's  dis- 
cretion, a three-speaker  module  presented  via  each  of  four, 
experimental  transmission  conditions  and  two  reference  conditions, 
or  anchors,  was  employed  for  purposes  of  the  present  series  of 
investigations . 

From  the  listener's  standpoint,  FARM  involves  two 
successive  utterances  of  each  of  30  sentences  by  each  speaker. 

The  listener's  cask  is  simply  to  rate  each  utterance  from  the 
standpoirt  of  transmission  quality  or  acceptability,  using  a 
scale  from  0 to  100,  A rating  of  100  Indicates  perfectly 
acceptable  transmission  quality,  a rating  of  0,  totally  unaccept- 
able quality,  a rating  of  50,  "half  good  enough,"  and  so  on. 

The  manner  in  which  the  test  speech  materials  are 
presented  to  the  listener  is  schematized  below: 
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First  Utterance 

IH 

2B 

3D 

4B 


Sfecond  Utterance 

IL 

2A 

3C 

4H 


27H  27B 

28C  28D 

29A  29B 

30L  30H 

where  the  numbers  from  1 to  30  identify  the  sentence  uttered 
and  the  letters  identify  the  anchors  and  individual  system- 
conditions  being  evaluated.  Specifically,  the  letter  H 
identifies  the  high  anchor,  L,  the  low  anchor.  The  letters 
A-D  identify  the  systems  or  conditions  being  evaluated,  ’^ere 
more  than  one  speaker  is  used,  the  test  speech  materials  for 
each  speaker  are  divided  into  two  halves  and  presented  in  a 
counter-balanced  fashion  i.e.. 


S 


cl 


S 


c2 


where  the  letter  subscript  identifies  the  speaker  and  numerical 
subscript  identifies  the  subset  of  test  sentences  spoken  by  that 
speaker , 
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4.2.2.  Test  Design  and  the  Control  of  Adaptation  Level  - 
From  the  above  discussion  of  adaptation  level  theory,  it  should 
be  evident  that  the  reliability  of  absolute  ratings  depends 
heavily  on  the  effectiveness  with  which  adaptation  levels  of 
individual  listeners  are  controlled  over  the  course  of  a single 
test  as  well  as  from  one  test  to  the  next.  It  is  clearly  desir- 
able that  individual  differences  in  residual  AL  be  effectively 
minimized,  whether  by  experimental  or  statistical  means.  Two 
aspects  of  the  design  of  FARM  are  directly  addressed  to  this 
problem.  First  is  the  manner  in  which  speech  samples  for  the 
various  system-conditions  under  test  are  temporally  ordered. 

Each  system-condition  is  presented  in  the  context  of  (i.e., 
following)  every  other  system-condition  under  test.  Context 
is  thus  very  nearly  uniform  across  the  system-conditions  being 
evaluated  in  a given  FARM. 

An  additional  contextual  feature  of  the  original 
version  of  FARM  is  provided  by  two  "anchors,"  a high  anchor  and 
a low  anchor,  each  of  which  is  heard  preceding  (and  following) 
each  system  under  evaluation  on  the  same  number  of  occasions. 

The  selection  of  anchors,  particularly  the  low  anchor,  was  a 
matter  of  special  concern.  It  was  considered  important,  first, 
that  the  anchors  represent  more  extreme  levels  of  acceptability 
than  those  likely  to  be  encountered  in  any  system-conditi  "^n 
subjected  to  evaluation,  and  secondly,  that  neither  anchor  be 
uniquely  distinguished  by  one  or  more  perceptual  qualities 
characteristic  of  a particular  type  of  system-condition  or  form 
of  speech  degradation  While  the  case  of  the  high  anchor  pre- 
sented no  particular  problem  in  this  connection,  the  case  of  the 
low  anchor  was  more  complicated.  Following  semantic  differential 
investigations  (see  Section  5 for  description  of  the  semantic 
differential  method)  involving  several  candidates,  a low  anchor 
was  obtained  by  tandemming  the  following  system-conditions ; 
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Linear  predictive  coder  (LPC) . Longbrake , at  2. A kbps  with  VL 
BER;  HY-2  channel  vocoder  at  2.4  kbps  and  CVSD  at  9.6  kbps 
with  57o  BER.  Gaussian  noise  was  added  to  give  a processed 
speech/noise  ratio  of  26-28  dB  lowpassed  at  4 kHz.  This  anchor 
was  characterized  by  an  average  acceptability  rating  of  approxi- 
mately 20  (100  point  scale)  and,  as  nearly  as  possible,  a 
"perceptually  neutral"  status. 

4.2.3  Scoring  FARM  Data 

4. 2. 3.1  Standard  Procedure  - In  principle,  the  scoring  of 
FARM  data  is  a relatively  straightforward  procedure.  The 
indicated  figure  of  merit  for  each  condition  is  simply  the  aver- 
age of  the  ratings  accorded  the  condition  by  the  listening  crew. 
Where  more  chan  one  speaker  is  involved,  additional  scores  con- 
sisting of  the  averages  associated  with  each  speaker  may  also  be 
obtained.  Tests  of  the  significance  of  intercondition  difference 
may  be  accomplished  by  means  of  some  form  of  analysis  of  variance 
in  the  case  of  appropriately  designed  experiments.  Alternately, 
differences  among  haphazardly  selected  conditions  may  be  tested 
by  means  of  the  Newman-Keuls  te.st  or  a related  type  of  test.  A 
specimen  presentation  of  FARM  results  is  provided  in  Figure  4.1. 
Showti  in  the  figure  are  the  average  ratings  of  system-conditions 
and  anchors  for  individual  listeners  and  for  the  crew.  Shown  in 
the  lower  part  of  the  figure  is  the  difference  matrix  used  in 
evaluating  the  significance  of  differences  with  the  Newman-Keuls 
test  (see  Wl.ner,  1972). 

4. 2. 3. 2 Special  Problems  - Ideally,  the  contribution  of  indiv- 
idual differences  in  subjective  origin  and  scale  to  the  variance 
of  rating  results  are  small  by  comparison  with  the  contributions 
of  systematic  factors.  With  relatively  large  listening  crews 
(30  or  so  listeners),  this  situation  may  prevail.  However,  the 


BiTlS;.S  FiiB  oF  llbUnlMS  *l  BOSS  SfcA»tNSI 


D •« 

X 9^  • 

^ rjkArrjirJ\j\r  9 


• ••••••••••  ••n 

« tJ^rrrrTrjsr  r — 

r 


£ «>«o—  ^:3^-«r»a  — 

« 

C TJ\T9rTTTJiT  ♦*-• 


:j  > r 

o > 

— l\ 

z 

• 

• 

m • 

r 

^ ■>  j n 3-1 

-n  r 

>i  • 

* • 

• • 

•• 

*s 

f>#  — ^ ^ ^ % 

#S  'S 

•W 

ri 

• 

u 

n 

y 

y 

yi 

z 

• • 

* 

• 

t 

• 

t 

« 

• A 

y 

n 

M 

< 

• lA 

IM 

M 

(M 

mi 

o 

« 

4 

r 

-n  a > z T o 

> — 

O 

E 

• 

* 

• ••••• 

• • 

• y 

VI 

y « a ^ ^ 

« 

• 

E 

> 

r 

9 r 9 r ^ 9 

y y 

y — 

M 

A 

•- 

<« 

• y 

T 

y 

A 

y 

X 

• • 

• 

• 

• 

y 

w 

yi 

• M 

y 

mm 

Ml  y* 

*> 

• 

a. 

« z 

y» 

a 

e-  9 — « a «i« 

o y 

pfk  zi 

Q 

• 

« 

• • 

• y) 

«|  Ml 

/I 

jr 

-9 

y » O 'H  n tfj 

•1  y 

s/I  • 

lA 

8 C 

r 

r 

r r r r 9 T 

£S  -1 

y — 

E 

•/» 

«» 

Ml  W 

< 

W O 

• 

• 9 

y 

n 

V 

z 

• m 

• 

• 

y 

s*l  'I' 

SA 

• ^ 

S 

X X 

• 

Ml 

w — 

I/I 

3 

a 

r 9 9 9 T >J> 

9 

••  ^ 

Mi  < 

• 

• « 

« 9 

■ 

r» 

r 

— r 

y o 

9 • 

>• 

r 

Jt 

Ji  T T T T 9 

^ r 

y — 

Ml 

o 

«D 

y 

9 

• 8 

'A 

U1 

« • 

• 

y 

Si 

yi 

• O 

N 

X 

Ml  •! 

y 

• 

ml 

v>  'j 

yi 

O 

O 

< X 

y a 

9 

z z 

• 

• 

• « 

• c 

Mi 

« u 

yn 

/) 

•-»  »•  J>  ^ '3 

•• 

y • 

W 

%i*  z 

p- 

y r jt  T *A 

fc/»  y 

y — 

M. 

•• 

M.  W 

a 

— M. 

w 

• 9 

z •• 

» « 

y 

• 

• 

« 

yi 

• 8 

o 

y 

• M 

ife 

Jk  M 

yi 

9 

« 

•n  ^ ji  r r 

y 

y > 

9 

• 

• 

• • 

• o 

a 

9 J\  9 ^ ^ 9 

a -1 

fB  • 

v» 

a 

r* 

^ ^ 9 rs  ^ 

— 

w 

2 «r 

^ w 

^ • 

* s/l 


3 

Ml 

M 

• 

4; 

£ 


< 

« 


u 

t/i 

/» 


o 

« 

m, 

<J 

Z 

yi 

yi 

lA 

yi 

sA 

s/I 

9 

27 


Fig.  A. I Specimen  Set  of  FARM  Results 
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economics  of  routine  system  evaluation  makes  it  desirable  to 
minimize  the  crew  size  requirement.  Experimental  evaluation  of 
listener  differences  in  adaptation  level  with  commensurate 
adjustment  of  individual  listener  data  for  differences  in  sub- 
jective origin,  offered  one  means  to  this  end. 

An  individual's  rating  of  the  high  and  low  anchors, 
common  to  all  PARM  sets,  provided  the  basis  for  evaluating  AL 
differences.  To  the  extent  that  a listener  is  atypically 
lenient  in  his  ratings  of  both  anchors,  it  is  a reasonable 
hypothesis  that  he  is  likewise  atypically  lenient  in  his  ratings 
of  the  experimental  systems  or  conditions  being  evaluated--that 
his  subjective  origin,  or  AL.  is  atypically  low.  To  the  extent 
that  his  ratings  of  the  high  and  low  anchors  deviate  in  opposite 
directions  from  the  respective  normative  values  for  the  two 
anchors,  it  is  appropriate  to  hypothesize  that  his  subjective 
scale  is  atypically  expanded  or  constricted  depending  on  the 
manner  of  deviation.  His  ratings  of  the  anchors  can  thus  provide 
a basis  for  "correcting"  his  responses  to  the  systems  under  eval- 
uation . 


It  is  convenient  in  the  above  connection  to  represent 
the  response  of  the  typical  or  ideal  listener  to  system-condition, 
i,  in  terms  of  an  equation  of  the  form; 

= A + IX  = S + B (R^  - 5)  , 

where  is  the  average  or  ideal  rating  of  system-condition,  i. 

A is  the  ideal  listener's  subjective  origin;  B is  a slope  or 
scale  factor,  (which  is  "1"  by  definition  the  case  of  the  ideal 
listener)  and  X is  the  perceived  difference  between  the  system- 
condition  in  involved  and  the  ideal  subjective  origin.  To  the 
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extent  that  the  response  of  a given  listener,  differs  from 
that  of  the  ideal  listener,  , such  differences  may  be  attri- 
buted to  individual  variation  with  respect  to  sub.iective  origin. 
A,  and  slope  or  scale  factor. 

Given  that  perfectly  reliable  means  were  available 
for  determining  individual  subjective  origins  and  slope  factors, 
the  response  of  an  individual  listener,  R^^  can  be  transformed 
to  its  ideal  equivalent  by  appropriate  scale  and  origin  adjust- 
ments , i . e . , 


<Aj-  A)  +1  (Ry 


- Aj)  , 


what  remains  to  be  determined  is  a means  of  estimating  A^  and 
Bj  it  was  hypothesized  chat  the  individual's  subjective  origin 
deviates  from  the  norm  if  the  average  of  the  ratings  he  assigns 
to  the  two  anchors  deviates  from  the  ideal  of  50.  It  was 
hypothesized  that  his  subjective  scale  deviates  from  the  norm 
if  the  difference  between  his  average  ratings  of  the  high  and 
low  anchors  deviates  from  58,  a historical  average  for  Dynastat 
crews . 


The  first  of  these  hypotheses  was  tested  by  examining 
the  correlation  between  A^  and  . Here,  A is  the  average  of 
many  ratings  made  by  an  individual  listener.  A^  is  the  average 
of  the  ratings  given  by  a listener  to  the  two  anchors  (histori- 
cally. 50)  and  A is  the  average  of  the  ratings  given  by  the 
same  listener  to  the  four  system-conditions  represented  in  a 
particular  FARM.  Over  the  course  of  a succession  of  such  tests, 
the  median  coefficient  of  correlaiton  (in  this  instance,  also 
the  regression  coefficient)  was  .70.’  The  implication  of  this 


This  assumes  equal  variances  for  average  system  rating  and 
average  anchor  rating,  which  condition  prevailed  during  the 
major  parr  of  this  investij-.-ition.  During  the  later  stages 
of  the  investigation,  the  variance  of  anchor  ratings  decreased 
somewhat  due  to  ill  conceived  instructions  given  the  listeners 
concerning  "typical  ratings"  for  the  two  anchors. 
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finding  is  that  individual  differences  in  do  reflect 

individual  differences  in  adaptation  level,  but  provide  less 

than  perfectly  reliable  indications  of  such  differences.  Thus 

the  most  appropriate  correction  for  individual  differences  in 

subjective  origin  is  something  less  than  the  difference  between 

an  obtained  individual  value  of  A and  the  ideal  or  normative 

o 

value  of  50.  Specifically,  the  indicated  correction  of  an  indi- 
vidual's racing  of  system-conditions  is,  on  this  basis, 

.70  (A^  - 50).  Given  for  example,  A^  = 60,  the  best  estimate  of 
the  individual's  "true"  subjective  origin  is  57,  ji.e., 

. 70(60-50)+50  j;  the  indicated  adjustment  of  his  ratings  of 
individual  system-conditions  is  a uniform  reduction  of  7 points. 

To  test  the  hypothesis  that  variations  in  subjective 
scale  contribute  significantly  to  the  variance  of  FARM  ratings, 
the  differences  between  each  individual's  ratings  of  the  high 
and  low  anchors  were  correlated  with  the  standard  deviation  of 
his  ratings  of  the  four  system  conditions  involved  in  each  FARM 
(The  greater  a listener’s  standard  deviation,  the  finer  his 
subjective  scale  and  the  greater  his  slope  relative  to  the 
typical  or  normative  listener).  Computed  on  large  samples  (16-20) 
of  listeners  on  a number  of  PARMs , the  median  coefficient  of  cor- 
relation was  found  to  be  .30.  From  these  results  it  was  concluded 
that  interanchor  rating  differences  reflect  individual  differences 
in  subjective  scale  and  can  thus  be  used  as  a basis  for  a scale 
factor  correction. 

Given  the  normative  interanchor  rating  difference  is 
58,  a listener  who  has  an  interanchor  difference  of  68  has  a 
finer  subjective  scale  (steeper  slope)  than  the  average.  If 
interanchor  ’'ating  difference  were  a perfectly  reliable  indicant 
of  an  individual's  subjective  scale,  transformation  of  scale 
would  be  accomplished  simply  by 
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where  AD  is  the  observed  anchor  difference  for  a single  indiv- 
o 

idual , is  his  response  to  a given  condition  and  is  his 
true  subjective  origin.  In  fact,  an  observed  deviant  AD  warrants 
an  estimate  that  the  individual’s  subjective  scale  is  increased 
by  .30  jAD^  - 58  j ; that  his  "true"  interanchor  difference  (AD^) 
is  58  + .30  j AD^  - 58 j . The  appropriate  scale  adjustment  factor 
thus  becomes 

58  _ 58 

AD^.  58  + .30  (AD^  - 58) 

On  the  basis  of  these  findings  the  following  equation  was  devel- 
oped as  an  interim  means  of  correcting  rating  data  for  individual 
differences  in  subjective  origin  and  scale 

K-h-  70(^-50)  + ho  ■ + .70(a„-50)] 

where  R^  is  the  estimated  rating  of  an  ideal  listener,  A^  is  the 
observed  average  rating  of  the  two  anchors  by  a given  listener, 
AD^  is  the  observed  difference  in  ratings  of  the  two  anchors, 
and  R^  is  the  observed  or  actual  rating  of  a condition  by  a 
given  listener. 

If,  for  example,  an  individual  listener  rates  the 
high  anchor  89,  the  low  anchor  41,  and  a given  system-condition 
63,  his  adjusted  rating  of  the  system-condition,  R^ , is  cal- 
culated as  ; 
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63  - 50  + .70(65-50) 

= 65  - 10.5  + ~ (63-60.5) 

= 54.5  + 2.5  = 57.0 


65  - .70(65-50)  + ;8 


Application  of  the  above  equation  serves  two  distinct  but 
related  functions.  On  one  hand,  it  serves  to  reduce  the  effects 
of  sampling  errors  which  may  express  themselves  as  crew  differ- 
ences, particularly  in  cases  involving  small  listening  crews. 

On  the  other  hand,  it  reduces  the  listener  component  of  variance 
within  crews.  This  effectively  increases  the  sensitivity  or 
power  of  tests  for  significance  of  differences  between  systems 
rated  in  separate  PARMs , given  the  assumption  of  independent 
listener  samples.  Although  scale  adjustments  may  operate  to 
increase  the  sensitivity  of  significance  tests  conducted  on  sys- 
tems evaluated  in  the  same  FARM,  origin  adjustments  will  have  no 
effects  on  the  sensitivity  of  such  tests 

Further  research  on  the  issue  of  individual  differ- 
ences in  subjective  origin  and  scale  is  clearly  called  for. 

The  above  adjustments  served  effectively,  however,  for  the 
immediate  purposes  of  the  Narrow  Band  Consortium. 
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The  efficacy  of  adjustments  for  subjective  scale 
and  origin  differences  was  evident  on  many  occasions  over  an 
extended  period,  in  particular  as  such  adjustments  substantially 
increased  the  replicability  of  test  results,  both  within  and 
across  crews.  However,  after  six  months  or  so,  during  which 
the  listening  crews  had  intensive  exposure  to  FARM  on  a regular 
basis,  various  discrepancies  in  FARM  results  began  to  emerge. 

In  particular,  individual  system-conditions  which  were  subjected 
to  repeated  evaluation  in  varying  context  occasionally  received 
inconsistent  acceptability  ratings.  The  possibility  that  such 
inconsistencies  arose  from  contextual  differences  was  explored 
but  rejected.  No  malfunction  of  the  playback  equipment  could 
be  detected. 

Although  it  might  have  been  expected  that  the  above 
adjustments  for  origin  and  scale  shifts  would  offset  the  effects 
of  long  term  adaptation  level  drifts,  a complicating  factor 
emerged:  many  subjects  evidently  learned  to  identify  the  anchors 
and  to  rate  them  in  an  extremely  consistent  manner.  This  tendency 
was  undoubtedly  enhanced  by  the  fact  that  early  in  the  project  the 
subjects  were  apprised  of  the  "typical  ratings"  for  the  two  anchors. 
This  attempt  to  "homogenize"  the  listening  crews  proved  to  be  ill 
advised.  The  tendency  of  a number  of  listeners  to  assign  ratings 
of  80  and  20  to  the  high  and  low  anchors,  respectively,  regardless 
of  their  actual  subjective  scales  and  origins  significantly  reduced 
the  sensitivity  of  onchor  rating  to  individual  differences  in 
subjective  origin  and  scale.  Adjustments  based  on  ratings  of  the 
anchors  appeared  to  become  less  and  less  efficacious  with  the 
passing  of  time. 

In  a further  attempt  to  find  the  reasons  for  the 
observed  discrepancies  in  FARM  results, a number  of  FARM  sets 
evaluated  over  the  course  of  the  preceding  six  months,  were 
reevaluated  one  or  more  times.  With  rare  exceptions,  accept- 
ability ratings  of  individual  systems  were  lower  on  reevaluation 
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than  on  initial  evaluation.  Moreover,  the  size  of  the  drop 
appeared  to  be  related  to  the  dates  on  which  the  evaluations 
took  place.  From  these  and  other  data  it  was  possible  to 
define  a trend  which  indicated,  for  example,  that  a system- 
condition  evaluated  in  late  September  would  receive  an  average 
acceptability  rating  nearly  nine  points  lower  than  when  pre- 
viously tested  in  June. 

To  verify  the  above  trend,  the  multiple  correlation 
between  FARM  rating  and  Diagnostic  Rhyme  Test  diagnostic  scores 
was  computed  for  various  classes  of  system-conditions.  Multiple 
correlations  ranging  from  .60  to  .70  were  obtained,  depending 
upon  the  class  of  system-conditions  involved.  Examination  of 
the  differences  between  actual  FARM  ratings  and  predicted 
ratings  revealed  a pronounced  trend  as  a function  of  the  date 
of  the  FARM  evaluation.  Actual  FARM  ratings  generally  exceeded 
predicted  ratings  for  system-conditions  evaluated  early  in  the 
six  month  period,  but  consistently  fell  short  of  predictions 
during  the  later  stages  of  the  period.  The  trend  of  these  devi- 
ations as  a function  of  FARM  test  date  was  quite  consistent  with 
the  trend  derived  from  FARM  test-retest  comparisons.  Further 
confirmation  of  the  trend  was  provided  by  test-retest  results 
involving  single  system-conditions  in  different  contexts. 

Figure  4.2  represents  a somewhat  arbitrary  combination 
of  these  various  estimates  of  the  trend,  greatest  weight  being 
given  to  test-retest  for  complete  FARM  sets.  Whatever  its  valid- 
ity, the  cause  of  the  trend  is  yet  to  be  determined.  Its  value 
for  purposes  of  future  FARM  evaluations  is  open  to  question.  In 
any  case,  one  lesson  learned  from  this  experience  is  that  periodic 
checks  for  longterm  "adaptation  level  drift"  should  become  a 
standard  aspect  of  FARM  procedures.  As  will  be  shown  elsewhere, 
listener  differences  in  subjective  origin  and  scale  tend  to  be 
extremely  stable  over  the  course  of  a single  FARM,  over  a daily 
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rating  session,  and  over  somewhat  more  extended  intervals  of 
time.  However,  the  possibility  of  longer  term  trends  must  be 
recognized  and  provided  for  in  future  FARM  projects. 

It  should  perhaps  be  remarked  that  longterm  AL  drift 
became  evident  only  after  the  crews  involved  had  been  exposed 
to  FARM  for  several  months , during  which  period  they  were  sub- 
jected to  an  extremely  hea.^y  FARM  schedule.  It  is  possible, 
that  longterm  AL  drift  will  prove  to  be  less  of  a complicating 
factor  with  less  arduous  testing  regimens,  but  resolution  of 
this  issue  must  await  the  results  of  further  research. 

4.  2.  A Reliabil.lty  of  FARM  - A test  is  said  to  be  reliable 

to  the  extent  that  it  yields  replicable  or  self-consistent 
results.  The  reliability  of  a test  is  a measure  of  freedom 
from  error  and,  ultimately  of  resolving  or  discriminating 
power.  Reliability  varies  in  a predictable  manner  with  test 
length  in  particular,  and  with  redundancy  in  general.  Since 
test  length  is  a matter  of  some  economic  consequence,  detailed 
examination  of  the  reliability  of  FARM  is  appropriately  a 
matter  of  major  concern. 

Efficiency  in  the  use  of  testing  time  and  resources 
depends  heavily  on  the  manner  in  which  redundancy  is  utilized 
in  a test.  Ideally,  it  is  allocated  among  the  various  test 
parameters  in  such  a way  as  to  equalize  the  sampling  errors 
associated  with  these  parameters.  If,  for  example,  the  sampl- 
ing error  associated  with  speakers  were  found  to  be  extremely 
pronounced  in  a test  of  system  performance,  the  most  direct 
remedy  would  be  an  increase  in  the  sample  of  speakers  and 
(assumming  constraints  on  the  total  amount  of  data  collected 
per  speaker)  a decrease  in  some  other  dimension  of  redundancy. 
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More  comprehensive  treatment  of  the  relevant  principles  of 
experimental  design  is  not  feasible  here,  but  the  general 
principle  is  that  redundancy  be  allocated  in  proportion  to  the 
inf insic  variability  (variance)  associated  with  a test  para- 
meter. 


FARM  is  potentially  susceptible  to  a diversity  of 
extraneous  effects,  both  systematic  and  random.  Recognition 
of  this  fact  is  implicit  in  various  symmetries  that  charac- 
terize the  design  of  FARM.  The  issue  to  be  resolved  at  this 
point,  however,  is  whether  FARM,  as  initially  designed,  makes 
optimal  use  of  its  redundancy.  Described  below  is  a series  of 
investigations  which  bear  on  this  issue  and,  more  generally, 
on  the  reliability  of  FARM  results.  Because  FARM  test  materials 
are  impractical  to  assemble  without  the  special  facilities  avail- 
able at  DCEC,  it  was  necessary  to  draw  the  data  for  these  studies 
primarily  from  operational  system  evaluations  performed  under 
the  terms  of  Contract  No.  DCA100-75-C-0034.  Inevitably,  this 
served  to  impose  various  constraints  on  the  design  of  valida- 
tion experiments , but  did  permit  reasonably  rigorous  treatment 
of  the  major  issues.  Except  where  noted  otherwise,  data  used 
for  these  investigations  were  yielded  by  operational  tests, 
identified  as  2M,  7M,  8M,  and  32M.  Among  them  they  provided  a 
fairly  representative  sample  of  state-of-the-art  digital  voice 
systems.  All  were  6-speaker  (male)  tests,  each  involving  four 
system-conditions  and  two  anchors. 

4.2.4. 1 Components  of  FARM  Variance  - The  design  of  FARM  is 
such  that  FARM  results  are  amenable  to  analysis  of  variance  in 
which  the  testable  effects  are  (among  others)  listeners . speakers . 
trials . and  system-conditions . It  is  thus  possible,  to  estimate 


the  contributions  of  all  of  these  effects  to  the  variance  of 
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FARM  results.  The  principle  employed  in  deriving  such  estimates 
is  embodied  in  the  relation: 


MS^ 


where  MS^.  is  the  mean  square  for  an  effect  or  treatment,  (e.g., 

listeners)  '’i.  is  an  unbiased  estimate  of  the  true  variance 
E 

associated  with  the  effect,  is  the  random  component  and  t 
is  the  number  of  occasions,  e.g.,  number  of  ratings  made  by  a 
listener,  on  which  each  state  of  E is  represented  (not  to  be 
confused  with  the  degrees  of  freedom  associated  with  the  effect) . 
Thus  , 


is  the  estimated  contribution  of  E to  the  variance  of  a single 
observation.  In  turn,  the  estimated  variance  of  an  average  of 
t observations  is  given  by  Where  E is  an  undesirable  or 

extraneous  component,  it  is  clearly  desirable  to  minimize  t. 

If,  for  example,  7^  were  the  component  of  variance  attributable 
to  speakers  in  an  acceptability  rating  experiment , increasing  t 
would  serve  to  increase  the  contribution  of  speaker  sampling 
error  to  the  test  results.  A reduction  of  t,  with  a commensurate 
increase  in  the  number  of  speakers  would  serve  to  decrease  the 
speaker  effect  and,  generally,  to  increase  the  reliability  of 
the  test  without  increasing  its  length. 

Examination  of  data  from  four  representative  FARM 
sets  yielded  the  results  presented  in  Table  4.1.  Shown  for  each 
FARM  set  are  estimates  of  the  contributions  of  the  indicated 
effects  to  the  total  variance  of  listener  ratings  of  four  system- 
conditions.  Specifically,  to^  is  an  estimate  of  the  variance 
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contributed  by  an  indicated  effect  to  an  average  FARM  rating 

for  the  case  of  FARM  as  presently  constituted . Estimates  of  , 

E 

the  contribution  of  each  effect  to  the  variance  of  a single 
unit  of  observation,  are  also  shown  to  indicate  the  intrinsic 
variability  associated  with  each  effect.  Column  t shows  the 
number  of  unit  observations,  or  "trials"  involving  each  level 
or  case  of  the  effect  (e.g. , each  listener)  involved.  "Error 
pool"  identifies  the  effects  for  which  sums  of  squares  were 
pooled  tc  obtain  an  estimate  of  the  error  variance  in  each 
instance.  For  purposes  of  this  analysis,  it  is  assumed  that 
all  second  and  higher  order  interactions  are  insignificant-- 
a rather  strong  but  necessary  assumption,  considering  that  all 
the  involved  effects  are  fixed  rather  than  random  effects. 

Although  the  results  vary  somewhat  from  FARM  set 
to  FARM  set,  some  important  consistencies  are  evident.  Compared 
with  listeners  and  listener  x systems,  all  of  the  other  extra- 
neous effects  are  of  negligible  consequence.  Much  of  the 
inherent  redundancy  of  FARM  thus  appears  not  to  be  used  to  best 
advantage. 


In  particular,  the  results  bearing  on  the  importance 
of  context  are  consistent  with  earlier  findings  (Voiers,  1974) 
that  the  immediately  prior  condition  has  little  effect  on  the 
FARM  rating  of  a given  condition.  The  effect  of  speakers  appears 
to  be  negligible,  suggesting  that  listeners  are  not  generally 
biased  in  their  ratings  by  the  quality  of  the  speaker's  voice. 
There  is  some  indication  of  intejraction  between  speakers  and 
systems , suggesting  that  the  various  systems  are  not  equally 
receptive  to  all  voices.  However,  the  magnitude  of  this  inter- 
action is  not  substantially  greater  than  the  random  effect,  as 
estimated  by  the  interaction,  listeners  x speakers  x context  x 
systems . 
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Taken  together,  these  results  suggest  that  the  reli- 
ability of  FARM  could  be  substantially  increased,  at  no  cost  in 
total  amount  of  data  collected,  by  increasing  the  nximber  of 
listeners  and  proportionally  decreasing  the  amount  of  data  col- 
lected from  each  listener,  e.g. , by  dispensing  with  the  require- 
ment of  "all  possible  pairings  of  systems-conditions . " (Alter- 
natively, the  length  and  cost  of  FARM  could  be  reduced  at  no 
cost  in  reliability.)  However,  further  research  on  this  issue 
is  in  order  before  instituting  extensive  changes  in  the  design 
cf  FARM. 

4. 2. 4. 2 Split-half  Reliability  of  FARM  - Assuming  that  short- 
term contextual  factors  have  virtually  no  impact  on  FARM  ratings, 
as  is  indicated  in  Table  4.1,  the  second  half  of  a FARM  effec- 
tively replicates  the  first.  The  question  then  becomes  one  of 
whether  such  replication  is  in  fact  necessary.  To  the  extent 
that  the  two  halves  yield  equivalent  results,  a negative  answer 
to  this  question  is  warranted.  Two  aspects  of  first-half  - 
second-half  equivalence  are  of  interest.  It  is  of  interest  to 
know,  first,  whether  crew  average  ratings  undergo  systematic 
changes  from  the  first-half  to  the  second-half  and  second  whether 
individual  listeners  maintain  their  relative  positions  in  terms 
of  the  ratings  they  accord  the  system- conditions . 

FARM  sets,  2M,  7M,  8M,  and  32M  were  used  to  resolve 
the  above  issues.  Results  of  the  analyses  conducted  for  this 
purpose  are  presented  in  Table  4.2.  Shown  in  the  table  are  the 
average  ratings  given  to  four  system-conditions  by  a crew  of 
20  listeners  during  the  first  half  of  each  FARM  and  during  the 
second  half.  From  these  results  it  appears  that  little  or  no 
rating  drift  occurs  over  the  course  of  a FARM.  In  three  of  the 
four  cases  first-half  - second-half  differences  were  virtually 
non-existent.  In  the  fourth  case  a larger,  but  statistically 
insignificant,  difference  was  obtained.  Further  tests  involv- 
ing additional  FARM  sets  failed  to  provide  any  more  evidence  of 
rating  drift  from  first  to  second  half. 
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TABLE  4.2  Split-half  Reliability  of  Listener  Ratings 

(N  = 20) 


Mean  System  Rating 


FARM 

First  Half 

Second  Half 

Diff 

lii 

2M 

56.1 

56.0 

0.1 

0.0 

.79 

.97 

7M 

49.0 

49.0 

-0.8 

1.0 

.89 

.98 

8M 

51.6 

51.6 

0.0 

0.0 

.89 

.98 

32M 

51.4 

53.4 

-2.0 

z . 1/ 

.82 

.97 

Mean 

52.0 

52.5 

.5 

*For 

19  df,  P<  .05 

for  "t”  <2.09 
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Also  shown  in  Table  4.2  are  split-half  coefficients 
of  reliability  for  the  four  cases.  Specifically,  these  are 
coefficients  of  correlation  between  the  individual's  average 
rating  of  the  four  systems  for  the  first  and  second  halves  of 
each  test.  Though  far  from  perfect,  these  correlations  indicate 
a generally  high  degree  of  individual  consistency  from  one  half 
of  a FARM  test  to  the  next.  These  results  also  bear  on  the 
problem  of  crew  stability  from  one  half  to  the  next.  Application 
of  the  Spearman -Brown  Prophecy  Formula  (see  Guilford,  1954,  pp. 
353-354)  to  these  results  provides  the  basis  for  estimating  the 
correlation  that  would  prevail  between  crew  average  ratings  for 
the  first  and  second  halves  of  a FARM.  The  final  column  in 
Table  4.2  shows  that  for  a crew  of  eight  listeners,  virtually 
perfect  predictions  of  average  (four)  system  ratings  from  one 
half  of  a FARM  to  the  other  could  be  achieved. 

Tiie  most  important  conclusion  to  be  drawn  from  these 
results  is  simply  that  AL's  for  listeners  and,  in  turn,  crews 
remain  exceptionally  stable  over  the  course  of  a FARM.  Data 
obtained  from  the  second  half  of  a FARM  provide  little  addi- 
tional information. 

4. 2. 4. 3 Effects  of  Utterance  Position  - Another  redundant 
aspect  of  FARM  stems  from  the  fact  that  each  system-condition 
is  evaluated  equally  in  the  "first  utterance"  position  and  in 
the  "second  utterance"  position.  A comparison  of  the  results 
obtained  under  these  two  conditions  is  thus  of  interest.  This 
comparison  is  provided  in  Table  4.3.  A significant  systematic 
difference  between  first  utterance  and  second  utterance  ratings 
is  evident  in  three  out  of  four  cases.  Other  things  equal, 
listeners  evidently  tend  to  rate  systems  more  favorably  when 
they  are  presented  via  the  second  utterance  than  presented  via 
the  first.  The  reasons  for  this  difference  are  not  clear,  but 


44 


TABLE  4.3  Interutterance  Differences  and  Correlation 

(N  - 20) 


Mean  System  Ratings 


FARM 

First  Utterance 

Second  Utterance 

Diff 

"t"* 

^i 

2M 

55.9 

56.3 

- .4 

1.18 

.94 

1.00 

7M 

48.6 

50.1 

-1.5 

4.10 

.97 

1.00 

8M 

51.2 

52.0 

- .8 

2.96 

.97 

1.00 

32H 

51.9 

52.9 

o| 

1 

3.50 

.98 

1.00 

Mean 

51.9 

52.8 

- .9 

*For  19  df.  P.-  .01  for  "t”  - 


2.86 


45 


a reasonable  hypothesis  is  that  the  greater  familiarity  of  a 
sentence  on  second  utterance  enhances  its  intelligibility  and 
in  turn,  its  overall  acceptability.  (There  are  subsequent  indi- 
cations that  inter-utterance  rating  differences  decrease  as 
listener  gains  greater  familiarity  vrith  the  corpus  of  test  sen- 
tences.) But  while  listeners  tended,  systematically,  to  rate 
systems  more  favorably  in  the  second  utterance  position  than  in 
the  first,  there  is  high  correlation  between  listener,  ratings 
in  the  two  positions.  At  the  listener  level  and  the  crew  level, 
second  utterance  ratings  are  highly  predictable  from  first  utter- 
ance ratings.  Thus,  little  additional  information  is  provided 
by  the  second  utteran.e  data. 

4. 2. 4. 4 Intercondition  Effects  - In  sections  4.2.4. 3 it  was 
shown  that  listener  differences  in  first  utterance  ratings  were 
highly  correlated  with  listener  differences  in  second  utterance 
ratings.  There  is,  however,  an  additional  issue  relating  to 
interutterance  dependencies  which  merits  examination.  This  is 
the  issue  of  the  general  effects  of  one  stimulus  condition  on 
the  rating  of  the  immediately  following  condition.  Adaptation 
level  theory  would  lead  to  the  prediction,  other  things  equal, 
of  a negative  correlation  between  successive  ratings  by  an 
individual  listener.  A highly  rated  initial  condition  should 
tend  to  depress  the  rating  given  the  succeeding  condition.  A 
low  quality  initial  condition  should  tend  to  enhance  the  per- 
ceived quality  of  the  condition  which  follows  it.  Earlier 
research  on  this  general  issue  has  led  to  the  conclusion  that 
such  effects  are  of  generally  negligible  magnitude.  However,  a 
further  investigation  of  the  issue  seemed  warranted,  and  was 
accordingly  undertaken.  Data  from  four  FARM  sets  (2M,  7M,  8M,  32M) 
were  used  for  this  purpose.  These  data  consisted  of  second 
utterance  ratings  for  which  the  preceding  conditions  were  one  or 
the  other  of  the  two  anchors,  effectively  providing  a "worst  case" 
test  of  adaptation  level  stability.  The  test  involved  an  analysis 
of  variance  with  factorial  design  in  which  the  main  effects  were 
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system-condition . preceding  anchor,  and  listener . A separate 
analysis  was  performed  for  each  of  the  four  FARM  sets  (each 
set  involved  different  system-conditions).  In  all  cases,  aver- 
age ratings  were  higher  when  the  ,, preceding  condition  was  the 
low  anchor  than  when  it  was  the  high  anchor.  However,  the 
magnitude  of  this  effect  and  of  the  interaction  of  systems  and 
context,  though  statistically  significant  (Table  4.4)  in  three 
instances,  was  generally  quite  small.  Moreover,  even  smaller 
effects  are  to  be  expected  when  less  extreme  preceding  con- 
ditions are  involved.  An  example  (FARM  set  7)  is  provided 
in  Fig.  4.3  where  the  independent  variable  is  the  average  first 
utterance  rating  of  a preceding  condition  (system  or  anchor) , 
the  dependent  variable  is  the  average  rating  of  the  following 
condition,  and  the  parameter  is  the  identity  of  following  con- 
dition. In  no  case  does  the  average  rating  of  the  following 
condition  vary  substantially  as  a function  of  the  average  rating 
of  the  preceding  condition,  although  the  effect  is  statistically 
significant  under  extreme  circumstances.  These  results  are 
consistent  with  those  of  Parducci  (1964)  and  Voiers  (1974) , to 
the  effect  that  the  extreme  stimulus  conditions  experienced  in 
an  experimental  situation  do  exert  a pronounced  effect  on  the 
subject's  response  to  other  stimuli,  and  that  this  effect  tends 
to  remain  fairly  constant  throughout  the  course  of  a laboratory 
session.  Subsequent  exposures  to  extreme  stimuli  are  not  accom- 
panied by  substantial  adaptation  level  changes.  As  Parducci 
(1964)  has  observed; 

"The  relative  permanence  of  this  end-anchoring 
in  simple  laboratory  situations  may  tend  to 
obscure  trial-to-trial  changes  in  AL.  It  is 
as  though  the  two  extreme  stimuli  were  constantly 
present  as  standards  against  which  each  of  the 
successive  stimuli  are  compared." 
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TABLE  4.4  Effects  of  Iirsnedlate  Context  (preceding  condition) 

on  PAR?!  Ratings 


Source 

Degree  of 

Freedom  Error 

F-Ratios  for 
2M  7M 

PARM 

8M 

Sets* 

32M 

1 . SYSTEM 

3 

(5.) 

68.3  18.2  14.5 

5.9 

2 . CONTEXT 

(preceding  anchor) 

1 

(6.) 

8.2  4.1 

6.4 

12.4 

3.  LISTENERS 

19 

-- 

4.  SYSTEM  X CONTEXT 

3 

(7.) 

1.4  .7 

5.7 

5.9 

5.  SYSTEM  X LISTENERS 

57 

-- 

6.  CONTEXT  X LISTENERS 

19 

-- 

7.  SYSTEM  X CONTEXT  x 
LISTENERS 

57 

• - 

TOTAL 

159 

Mean  rating  difference  ("low  anchor 
preceding"  minus  "high  anchor 
preceding") 

1.8  .7 

1.3 

2.5 

*For  3 and  57  degrees  of  freedom,  P 

< .05 

for  F s 2.76 

and 

P < .01  for  F > 4.13; 

for  1 and  19 

degrees  of  freedom,  P < 

.05 

for  F ^ 4.38  and  P < . 

01  for  F * 8. 

.18. 
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Seen  in  the  above  light,  the  practice  of  pairing 
all  systems  would  appear  to  constitute  a fairly  inefficient 
use  of  resources.  It  would  seem  necessary,  at  most,  to  insure 
that  all  systems  under  evaluation  were  preceded  on  an  equal 
number  of  occasions  by  each  of  the  two  anchors. 

4. 2. 4. 5 Inter  FARM  Reliability  - From  the  results  of  the 
‘foregoing  analyses  it  can  be  concluded  that  individual  and  crew 
adaptation  levels,  as  measured  by  average  system  ratings, 
remain  quite  stable  over  the  course  of  a FARM  testing  session. 
Intraindividual  variation  in  FARM  ratings  is  either  negligible 
or  adequately  controlled  by  the  design  of  FARM.  Remaining  to  be 
answered  are  questions  concerning  listener  and  crew  stability 
over  longer  periods  of  time.  To  resolve  this  issue,  a crew  of 
20  listeners  was  subjected  to  two  administrations  of  a represen- 
tative PAR.M  set  (335A,  3 male  speakers)  during  the  same  testing 
session.  The  first  of  these  administrations  was  at  the  beginning 
of  a routine  4^-hour  testing  session;  the  second,  near  the  end. 
The  crew  participated  in  various  other  routine  tests  during  the 
intervening  period.  Table  4.5  shows  the  average  rating  received 
by  the  four  system  conditions  and  two  anchors  under  each  admin- 
istration. 


Because  of  the  possibility  that  ratings  of  the  two 
anchors  were  subject  to  the  extraneous  influences  discussed 
earlier,  the  two  administrations  were  compared  using  data  for 
the  system-conditions  only.  A test  for  the  significance  of 
mean  differences  yielded  a ’’t”  of  0.95  which  does  not  approach 
statistical  significance.  The  coefficient  of  correlation  be- 
tween individual  listener's  mean  system-condition  ratings  on 
the  two  administrations  was  .90.  When  the  Spearman-Brown 
formula  is  applied  to  estimate  the  correlation  to  be  expected 
between  crew  means  on  repeated  administration,  this  coefficient 
becomes  .99  for  the  case  of  an  8 member  crew.  The  stability  of 
FARM  results  over  the  course  of  a testing  session  appears,  there- 
fore, to  be  extremely  high. 
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TABLE  4.5 


Condition 

High  Anchor 
D 
A 
C 
B 

Low  Anchor 


MEAN  (All 
conditions) 
MEAN  (Systems 
only) 


For  19  df.  P 


Intrasession  Stability  of  FARM  Results 


First 

Administration 

80.8 

54.3 

42.0 

41.7 

39.7 
20.9 


46.6 

44.4 


< .01  for  "t” 


(N-20) 

Second 

Administration 

80.8 
55.0 

41.8 
43.4 
39.6 

19.9 


46.7 

44.9 


2.09 


Dlff  ”t"  IjA 

0.0 

-0.7 

-0.2 

1.7 

-0.1 

-1.0 


-0.1 

-0.5  .95  .90 


.99 
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^•2.4.6  Effects  of  Instruction  - In  view  of  the  dramatic 
long-term  changes  in  listener  performance  that  occurred  over 
the  course  of  this  project,  it  was  of  some  interest  to  know 
the  effects  of  instructions  upon  listener  behavior  in  the  FARM 
situation,  particularly  as  the  instructions  received  by  indi- 
vidual listeners  (and/or  their  comprehension  of  these  instruc- 
tions) varied  somewhat  over  the  period  of  time  involved. 

Accordingly,  an  investigation  was  undertaken  in  which  an  attempt 
was  made  to  evaluate  the  extremes  to  which  listener  performance 
might  reasonably  be  affected  by  instructions.  The  speech 
materials  used  for  this  investigation  were  provided  by  FARM  sets 
180  and  181,  both  of  which  were  subjected  to  a fixed  amount  of 
intermodulation  distortion  before  presentation  to  the  listeners. 

(This  last  feature  is  not  relevant  in  the  present  context,  having 
been  introduced  for  purposes  of  another  experiment.) 

Two  crews  were  employed.  One  crew  was  administered 
FARM  set  180  on  two  occasions,  being  instructed  on  the  first 
occasion  to  “rate  as  leniently  as  you  conceivably  ever  have 
during  the  course  of  your  experience  with  FARM.”  Following 
a 30-minute  break,  this  crew  was  again  administered  FARM  set 
180,  being  instructed  on  this  occasion  to  "rate  as  stringently 
as  you  ever  conceivably  have  during  your  experience  with  FARM." 

The  second  crew  was  administered  FARM  set  181  in  a 
similar  fashion,  except  that  the  time  order  of  the  two  instruc- 
tional conditions  was  reversed  from  that  of  the  previous  case. 

The  results  of  this  experiment  are  summarized  in  Table  4.6. 

From  the  table  it  appears  that  the  instruction  given  the  subject 
can , in  the  extreme , increase  or  decrease  his  effective  adapta- 
tion level  on  the  order  of  six  rating  points.  Although  the  obtained 
correlation  between  averages  for  individual  raters  under  the  two 
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TABLE  4.6  Effect  of  Instructions  on  FARM  Ratings 


Mean  System  Rating 


’•Stringent'* 

Condition 

"Lenient" 

Condition 

Diff . 

f 1 

r 

FARM 

180(N-9) 

30.4 

42.6 

12.2 

4.54 

.06 

FARM 

181 (N-7) 

29.2 

41.1 

11.9 

8.47 

.92 

For  8 df,  P < .01  for  t 3.36;  with  6 df,  P <.01  for  t * 3.71 
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conditions  was  drastically  reduced  by  a single  deviant  listener 
in  the  case  of  FARM  180,  the  true  correlation  appears  to  be  quite 
high:  individuals  and  crews  respond  in  a relatively  uniform 

manner  to  instructions  regarding  the  rating  "set"  they  should 
adopt . 


In  view  of  the  fact  that  differences  in  the  instruc- 
tions given  subjects  at  different  times  in  the  course  of  this 
project  never  approached  the  extremes  represented  here,  it  seems 
highly  unlikely  that  changes  in  listeners'  conceptions  of  their 
task  could  have  accounted  to  a significant  extent  for  the  long 
term  adaptation  level  drift  (implied  by  a 10-point  drop  in  average 
ratings)  described  in  Section  4. 2. 3. 2. 

4 . 2 . 4 . 7 Evaluation  and  Control  of  Listener  Differences  - From 
the  various  results  described  in  the  foregoing  sections  it  is 
evident,  on  one  hand,  that  individual  differences  in  adaptation 
level  represent  the  major  source  of  sampling  error  in  FARM 
ratings.  On  the  other  hand,  there  is  substantial  evidence  con- 
cerning the  stability  of  individual  adaptation  level,  both  over 
time  and  over  a diversity  of  experimental  conditions.  Taken 
together,  these  results  attest  further  to  the  feasibility  of 
"calibrating"  listeners  and,  in  turn,  of  adjusting  rating  data 
to  compensate  for  such  differences.  The  use  of  high  and  low 
anchor  ratings  for  such  purposes  was  in  fact  instituted  as  part 
of  the  standard  FARM  scoring  procedure  quite  early  in  the  program. 
However,  the  question  of  whether  ratings  of  anchors  provide  the 
optimal  bases  for  evaluating  the  prevailing  adaptation  levels  of 
individual  listeners  remained  to  be  determined.  Accordingly, 
further  research  on  the  issue  was  undertaken  using  data  from 
FARM  sets  2M,  7M,  8M,  and  32M  yielded  by  a crew  of  20  listeners. 
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On  the  hypothesis  that  individual  adaptation  levels 
remain  stable  during  the  course  of  a single  FARM,  individual 
•^iffs^snces  in  ratings  of  the  anchors  and  system-conditions 
should  be  correlated  to  some  degree.  The  question  then  arises 
as  how  best  to  detect  individual  differences  in  adaptation  level. 
Factor  analysis  provides  a means  of  resolving  this  issue. 

For  each  of  the  FARM  sets,  the  correlations  among 
individual  listener's  ratings  of  the  two  anchors  and  four  experi- 
mental system-conditions  were  determined.  The  obtained  correla- 
tion matrices  were  then  subjected  to  a principle  axis  factor 
analysis.  The  results  of  these  analyses  are  summarized  in 
Table  4.7. 


Uniformly  high  positive  loadings  of  anchors  and 
system-conditions  on  Factor  I serve  to  identify  this  factor  as 
adaptation  level  or  subjective  origin.  The  implication  of  this 
configuration  of  loadings  is  that  listener  differences  in 
ratings  of  all  conditions  are  subject  to  a common  influence: 
knowledge  of  an  individual's  deviance  in  rating  any  one  con- 
dition thus  has  value  for  predicting  his  deviance  in  rating  any 
other  condition.  These  results  are  consistent  id.th  earlier 
findings  regarding  the  correlation  between  average  anchor  ratings 
and  average  system  ratings,  but  they  yield  several  important 
additional  insights. 

One  inference  to  be  drawn  from  the  results  in  Table 
4.7  is  that  the  high  and  low  anchors  do  not  provide  the  best 
possible  means  of  evaluating  individual  adaptation  levels.  The 
basis  of  this  inference  is  to  be  found  in  the  relatively  low 
Factor  I loadings  of  the  anchors  in  all  four  cases.  The  high 
loadings  of  the  system-conditions  which  fall  near  the  midrange 


TABLE  4.7  Factor  Structure  of  FARM  Ratings 


(N-20) 


FACTOR  LOADINGS 


Factor  I Factor  II 


FARM  Set 

Condition 

2M 

7M* 

8M 

32M 

Mean 

2M 

7M* 

8M 

32M 

Mean 

High  Anchor 

.36 

.40 

.67 

.57 

.50 

.84 

.88 

.60 

.77 

.77 

System  A 

.63 

.91 

.88 

.87 

.82 

.41 

.11 

.14 

.24 

.17 

System  B 

.86 

.87 

.89 

.95 

.89 

-.02 

-.16 

.12 

-.06 

.03 

System  C 

.88 

.82 

.94 

.90 

.89 

.08 

-.50  - 

.09 

-.24 

-.18 

System  D 

.84 

.81 

.86 

.88 

.85 

00 

CM 

i 

-.49  - 

.26 

-.26 

-.29 

Low  Anchor 

.53 

.54 

.39 

.62 

.52 

-.70 

-.70  - 

.84 

-.36 

-.64 

Percent 

Trace 

.50 

.56 

.63 

.66 

.59 

.24 

.29 

.19 

.14 

.21 

* Original  factor  axes  arbitrarily  rotated. 


56 


of  the  acceptability  continuum  indicate  that  midrange  conditions 
are  better  adapted  for  purposes  of  Sensing  individual  differences 
in  adaptation  level.  Factor  I loadings  in  the  .85  - .95  range 
serve,  in  fact,  to  suggest  that  a single  "midrange  anchor"  could 
serve  quite  effectively  for  purposes  of  calibrating  individual 
listeners.  Knowing  individual  ratings  of  such  an  anchor  would 
permit  the  investigator  to  account  for  (and  adjust  for)  something 
on  the  order  of  817,  (.90^  ) of  the  sampling  error  associated  with 
individual  differences  in  adaptation  level.  By  contrast,  the 
optimal  combination  of  high  and  low  anchors  would,  at  best,  suffice 
to  account  for  approximately  52%  (.50^  + .52^  ) of  this  component 
of  variance. 

An  examination  of  the  pattern  of  loadings  on  Factor  II 
reveals  this  factor  to  be  a subjective  scale  factor.  Specifically, 
high  loadings  (though  of  opposite  sign)  uniformly  exhibited  by 
the  high  and  low  anchors  indicates  that  listeners  differ  in  terms 
of  the  subjective  scales  to  which  they  reference  their  ratings. 
Other  things  equal,  the  listener  who  tends  to  be  more  extreme  in 
rating  at  one  end  of  the  scale  also  tends  to  be  more  extreme  in 
rating  at  the  other  end.  Given  no  listener  differences  in  adap- 
tation level,  one  would  thus  expect  to  find  a negative  correlation 
(or  factor  loadings  of  opposite  sign)  between  ratings  of  the  high 
and  low  anchors.  The  pattern  of  Factor  II  loadings  thus  indicates 
that  a substantial  amount  of  the  listener  component  of  variance 
in  FARM  ratings  can  be  attributed  to  individual  differences  in 
subjective  scale  and  that  the  interanchor  range  for  individual 
listeners  can  provide  a means  of  controlling  this  subcomponent 
of  variance.  It  should  be  noted,  however,  that  the  practical 
benefits  of  such  controls  will  tend  to  be  rather  limited,  except 
in  circumstances  involving  system-conditions  falling  near  the 
extremes  of  the  acceptability  continuum. 
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A- 2. 4. 8 Evaluation  and  Control  cf  Speaker  Factors  - As  noted 
in  Section  4.2.4. 1,  the  magnitude  of  the  speaker's  contribution 
to  FARM  variance,  is  small,  compared  to  the  contribution  of  the 
listener.  However,  its  statistical  significance  was  an  unresolved 
issue.  Further  analysis  of  the  data  from  FARM  sets  2M,  7M,  8M,  and 
32M  yielded  results  which  bear  on  this  issue.  They  are  presented 
in  Table  4.8.  In  two  of  the  four  cases  the  main  effect  for 
speakers  is  significant  at  the  .01  level.  In  all  four  cases 
the  interaction  of  speakers  and  systems  are  significant.  Evidently 
systems  vary  in  their  receptivity  to  individual  voices.  It  should 
be  noted,  however,  that  the  sample  of  speakers  involved  here  was 
in  no  sense  a random  sample.  Rather,  it  was  deliberately  selected 
to  provide  representation  of  extremes  with  respect  to  fundamental 
frequency.  The  practical  significance  of  these  results  is,  there- 
foi'e,  still  open  to  some  question.  A less  rigorous  examination 
of  data  from  a large  number  of  FARM  sets  revealed  that  speaker 
variation,  either  within  or  between  sexes,  is  rarely  of  magnitude 
comparable  to  that  associated  with  listeners  or  system-conditions . 
However,  further  research  on  this  issue  is  clearly  in  order, 

4.3  Interim  Conclusions  and  Recommendations  for  the  Use 

oFTarm 


From  the  diversity  of  experimental  results  described 
in  the  preceding  sections  two  major  principles  can  be  clearly 
discerned . 

1.  Listener  differences  accovmt  for  the  major  component 
of  the  extraneous  variance  of  FARM  results.  By 
comparison  the  contributions  of  other  systematic 
factors  is  negligible. 

2.  The  listener  component  of  variance  in  FARM  test 
results  has  its  origin  primarily  in  stable 
listener  differences  in  subjective  origin  or 
adaptation  level,  which  differences  are  eminently 
subject  to  statistical  evaluation  and  control. 


TABLE  4.8  Evaluation  of  Speaker  Contribution  to  FARM  Variance 
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Given  that  the  means  of  controlling  the  listener 
factor  can  be  found,  FARM  can  be  expected  to  provide  extremely 
reliable  estimates  of  system-acceptability  for  the  population 
represented  by  the  experimental  listener  sample.  Realization 
of  this  expectation  can  be  facilitated  if  cognizance  is  taken 
of  a number  of  secondary  or  corollary  principles  that  have  also 
emerged  from  the  results  of  research  conducted  during  the  period 
of  this  contract.  The  more  important  of  these  are  discussed 
below. 

4.3.1  Use  of  anchors,  probes  and  reference  standards  - It 

is  evident  from  an  accumulation  of  results  that  the  function  of 
anchors  and  the  function  of  reference  standards  in  rating  situa- 
tions are  quite  different.  Reference  standards  are  properly  used 
to  achieve  experimental  control  of  extraneous  variance  in  osycho- 
physical  experiments.  To  this  end,  the  identity  and  function  of 
reference  standards  are  normally  made  explicit  to  the  experimental 
subjects,  who  may  or  may  not  be  required  to  evaluate  the  standards 
themselves . 


By  its  mere  presence  an  anchor  exerts  some  degree  of 
experimental  control  of  adaptation  level.  Anchors  can  also  be 
used  to  achieve  some  degree  of  statistical  control  of  extraneous 
variance,  in  that  the  subject's  response  to  an  anchor  may  permit 
statistically  evaluation  of,  and  correction  for,  intra  and  inter- 
listener variation  in  AL.  For  such  controls  to  be  most  effective, 
however,  the  listener  must  be  unconstrained  in  his  response  to  an 
anchor,  as  experience  in  the  present  project  has  confirmed.  In 
the  present  case  an  attempt  was  made  to  experimentally  reduce 
individual  differences  in  subjective  origin  by  apprising  listeners 
of  the  historical  ranges  of  the  ratings  given  the  high  and  low 
anchors.  While  this  procedure  was  undoubtedly  efficacious  in  some 
respect,  subsequent  results  clearly  indicate,  that  it  substantially 
reduced  the  value  of  anchor  rating  for  purposes  of  sensing  residual 
individual  differences  in  subjective  origin. 
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Following  the  receipt  of  information  concerning  the 
historical  ratings  of  the  tv70  anchors,  some  listeners  effectively 
changed  their  subjective  origins  and  response  scales  when  re- 
sponding to  the  anchors,  but  were  unable  to  maintain  the  same 
frame  of  reference  when  rating  the  system-conditions  involved. 
These  findings  attest  to  the  validity  of  the  adaptation  level 
concept,  for  the  listeners  evidently  continued  to  rate  system- 
conditions  in  relation  to  stable  adaptation  levels,  even  while 
artificially  changing  their  modes  of  response  to  the  anchors.  The 
value  of  anchor  ratings  for  detecting  AL  differences  was,  however, 
greatly  reduced  under  such  circumstances. 

It  is  possible  that  some  benefit  is  to  be  realized  by 
identifying  the  extreme  anchors  for  experimental  listeners  without 
indicating  "appropriate"  ratings  of  these  anchors.  A wealth  of 
evidence  indicates  that  such  procedures  will  effectively  stabilize 
the  rating  behavior  of  the  individual  listener.  There  remains, 
however,  the  problem  of  stable  listener  differences  in  adaptation 
level,  which  differences  make  acceptability  ratings  highly  sus- 
ceptible to  listener  sampling  error. 

It  will  simplify  matters,  somewhat,  if  a termino- 
logical refinement  is  introduced  at  this  point.  Specifically,  it 
is  suggested  that  "anchor"  be  reserved  for  extreme  conditions 
whose  primary  function  is  to  exprimentally  reduce  intraindividual 
variation  in  adaptation  level.  The  term,  probe,  will  be  reserved 
for  conditions  used  primarily  to  sense  interindividual  differences 
in  adaptation  level,  to  the  end  of  permitting  retrospective 
statistical  adjustments  for  such  differences. 

Conditions  designed  primarily  to  serve  the  anchoring 
function  may,  in  fact,  have  some  value  as  probes  if  no  con- 
straints are  placed  on  the  listener's  responses  to  these  con- 
ditions. However,  the  various  results  described  above  attest 
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to  the  superiority  of  midrange  conditions  as  probes.  Whatever 
use  is  made  of  the  extreme  anchoring  conditions,  the  inclusion 
of  one  or  more  midrange  probes  would  thus  seem  to  be  highly 
desirable  in  the  case  of  FARM  or  similar  methods  of  acceptability 
evaluation . 


In  summary,  the  results  available  to  date  indicate 
that  the  reliability  of  FARM  can  be  significantly  enhanced  by 
the  use  of  two  extreme  anchors  and  one  or  more  midrange  probes. 

4.3.2  Feasibility  of  Listener  Seleccl on  as  a Means  of 

Enhancing  the  Reliability  of  FARM  Results  The  contribution  of 
listener  factors  to  the  variance  of  PARN  r(  ilts  has  been  dealt 
with  extensively  in  the  preceding  sections  The  evidence,  both 
implicit  and  explicit,  leaves  little  doubt  that  control  of  this 
factor  can  significantly  enhance  the  reliability  of  FARM.  Anchors 
and  probe  conditions  offer  one  mean.s  of  achieving  at  partial 
control  of  this  factor,  but  additional  means  are  available.  One 
is  through  the  astute  selection  of  listeners,  the  feasibility  of 
which  is  attested  to  by  a remarkable  degree  of  stability  over 
both  the  short  and  long  term  that  cha racterizes  the  performance 
of  the  typical  listener. 

A series  of  studies  has  shown  that  the  residual,  or 
steady  state . adaptaticn  level  ol  relatively  unselected  listeners 
can  vary  over  a range  of  20  points  on  the  acceptability  continuum. 
(The  most  tolerant  listener  among  Dynastat’s  crew  of  40  listeners 
consistently  rates  systems  20  point  higher  than  the  least  tolerant 
listener  on  the  crew) . because  of  the  self  consistency  of  the 
typical  listener,  however,  it  is  possible  to  select  a subsample 
of  listeners  for  which  individual  AL*s  (as  reflected  in  their 
ratings  of  a "probe  PARM-set")  have  a relatively  restricted 
range . 
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The  desirability  of  a standard  procedure  for  pre- 
sel^iction  of  FARM  listeners  seems  beyond  question  at  this  point. 
The  possibility  remains,  however,  that  further  refinement  of 
FARM  can  be  achieved  by  post-experimental  selection,  i.e., 
by  means  of  procedures  for  determining  that  individual  parti- 
pants  in  a test  have  performed  in  a consistent  fashion,  and 
that  their  data  have  been  accurately  evaluated.  One  such  pro- 
cedure that  has  been  employed  with  some  success  involves  com- 
paring the  individual  listeners  actual  rating  of  a system 
condition  with  an  expected  value  derived  as  follows; 


E 


ij 


A,  + A 
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where  A^  is  the  average  of  all  listeners  ratings  of  the  j th 
condition,  A^  is  the  average  of  the  ith  listener's  ratings  of 
all  conditions,  and  A^^  is  the  average  of  all  listeners' 
ratings  of  all  conditions. 


thus  becomes  a measure  of  the  extent  of  the  ith  listener's 
variability  with  respect  to  himself  and  to  the  crew  as  a whole. 
It  can  serve  effectively  as  a criterion  for  detecting  listeners 
who  have  lost  their  places  during  the  test,  whose  data  have 
not  been  accurately  transcribed,  or  who  simply  performed  in  a 
generally  erratic  manner  during  the  test.  However,  it  should 
be  noted  that  S.D.^  is  sensitive  to  true  interactions  of  systems 
and  listeners.  It  is  also  sensitive  to  individual  differences 
in  subjective  scale  and  must,  therefore,  be  used  with  some  dis- 
cretion when  applied  to  data  which  have  not  been  adjusted  for 
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such  differences.*  Somewhat  arbitrarily  an  S.D.^  of  greater 
than  7 has  been  employed  with  some  effectiveness  as  a basis  for 
post-experimental  rejection  of  listeners  in  the  present  project. 

In  summary:  The  reliability  of  FARM  results  can  be 
significantly  enhanced  by  careful  selection  and  calibration  of 
listening  crew  members  and  by  the  astute  use  of  systematic  pro- 
cedures for  post-experimental  rejection  of  inconsistently  per- 
forming listeners. 

4.3.3  Role  of  the  Speaker  - The  relevant  data  available 
during  the  course  of  this  project  do  not  permit  unequivocal  con- 
clusions concerning  the  importance  of  the  speaker  as  a factor  in 
PARK  results.  It  can  be  said,  at  least,  that  speaker  factors  are 
of  substantially  less  consequence  in  the  acceptability  rating 
situation  than  in  the  intelligibility  testing  situation.  Inas- 
much as  intelligibility,  is  a correlate  of  acceptability,  it  is 
possible  that  speakers  affect  acceptability  measurements  primarily 
through  their  effects  on  intelligibility.  Further  research  will 
be  needed  to  resolve  this  issue.  For  the  present,  the  use  of 
multiple  speakers  is  recommended. 

4.3.4  Miscellaneous  Experimental  Considerations  - Although 
it  was  reported  in  Section  4. 2. 4. 5 that  listener  performance  did 
not  deteriorate  or  otherwise  change  to  a significant  degree  over 
the  course  of  a 4%-hour  listening  session,  it  should  be  noted  that 
these  results  were  obtained  under  more  or  less  ideal  conditions, 
Listeners  participated  in  total  of  only  four  three-  speaker  PAR>18 
during  the  course  of  this  session.  These  PARMs  were  interleaved 

with  several  DRTs  which  resulted  in  "duty  cycle"  of  approximately  40%. 


The  introduction  of  this  checking  procedure  antedated  investiga- 
tions of  individual  differences  in  subjective  scale.  Subject  to 
the  results  of  further  research  on  such  differences,  the  checking 
procedure  can  be  easily  modified  to  remove  the  effects  of  sys- 
tematic scale  differences. 
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Experience  has  shovm  that  subject  morale  and  per- 
formance deteriorate  significantly  if  the  PARjM  test  load 
substantially  exceeds  the  equivalent  of  five  three-speaker  PARMs 
during  a normal  4%-hour  session.  On  one  occasion  early  in  the 
course  of  this  project  a specially  selected  crew  was  administered 
a total  of  eight  three-speaker  PAWls  during  the  course  of  a 4^- 
hour  session.  The  reactions  of  the  listeners  to  this  procedure 
took  the  form  of  one  resignation,  one  refusal  to  participate 
beyond  the  fifth  or  sixth  FARM,  and  vociferous  complaints  from 
the  remaining  crew  members.  Inspection  of  the  data  revealed 
excessive  "lost  places"  and  general  deterioration  of  performance 
beginning  with  the  sixth  or  so  PARM.  Clearly,  FARM  makes 
extremely  rigorous  intellectual  and  attentional  demands  on  the 
listener,  and  his  capacity  to  maintain  a stable  level  of  discri- 
minative performance  is  definitely  limited.  In  view  of  this 
consideration  the  extraneous  redundancy  of  PARM  becomes  an  even 
more  crucial  issue. 

In  summary,  modifications  which  lessen  FARM'S  demands 
on  the  listener's  attentive  capacities  are  clearly  desirable.  In 
the  meantime,  listener  exposure  to  the  original  version  of  PARM 
should  be  limited  to  the  equivalent  of  five  three-speaker  PARMs 
per  4%-hour  session,  with  or  without  interleaving  of  other  tests 
such  as  the  Diagnostic  Rhyme  Test.  (By  contrast  with  the  25-357o 
duty  cycle  that  listeners  can  tolerate  with  PARM,  a 50-60%  duty 
cycle  is  comfortably  tolerated  in  the  case  of  the  Diagnostic 
Rhyme  Test . ) 

4 , 4 Predictive  Validity  of  PARM 

On  the  hypothesis  that  both  PARM  and  QUART  provide 
valid  indications  of  system  acceptability  a high  degree  of 
correlation  between  the  two  measures  is  to  be  expected.  In  this 
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connection  it  was  noted  first  that  the  original  professional 
listener  sample  used  with  QUART  was,  but  for  a difference  in 
adaptation  level,  highly  correlated  with  the  Target  Sample  in 
its  perception  and  evaluation  of  the  sample  of  laboratory  and 
system  conditions  employed.  It  was  noted,  further,  that  a 
number  of  factors  undoubtedly  operated  to  reduce  the  reli- 
ability and  validity  of  the  QUART  data  obtained  from  the 
target  sample.  Accordingly  it  was  decided  that  a combination 
of  data  from  the  two  samples  would  provide  a more  valid  esti- 
mate of  the  "true"  acceptability  levels  of  the  sample  of 
conditions  involved.  From  such  a combination  a superior 
criterion  is  provided  for  purposes  of  validating  FARM. 

Specifically,  acceptability  ratings  of  the  system 
conditions  by  the  original  professional  listener  sample  were 
transformed  to  yield  a new  variable  with  the  same  mean  and 
standard  deviation  as  the  distribution  of  acceptability  ratings 
by  the  target  sample.  The  transformed  value  for  each  system 
condition  was  then  averaged  with  the  average  acceptability  rating 
accorded  it  by  the  target  sample,  and  these  averages  used  as 
criteria  for  testing  the  predictive  validity  of  FARM. 

During  the  term  of  this  project,  composite  criterion 
data  and  FARM  data  were  available  for  a sample  of  only  20  system- 
conditions.  However,  the  results  presented  in  Figure  4.4  leave 
little  doubt  as  to  the  fundamental  validity  of  FARM.  An  extre- 
mely high  correlation  would  have  been  obtained  but  for  the  two 
deviant  cases  (CONUS  Median  Voice  Grade  and  AFC  with  57,  BER)  . 

In  view  of  the  time  elapsed  between  the  processing  of  the  FARM 
speech  test  materials  and  the  QUART  speech  test  materials,  it 
is  a tenable  hypothesis  the  systems  involved  were  not  functioning 
in  the  same  fashion  on  both  of  the  occasions  in  which  they  were 
involved . 


COMPOSITE  ACCEPTABILITY  CRITERION 
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A. 5 Recommendations  for  Future  Use  of  FARM 


It  is  undoubtedly  evident  from  the  foregoing  dis- 
cussion that  FARM,  as  originally  conceived,  is  in  need  of  some 
refinement  before  it  ctn  rival  such  speech  evaluation  isntru- 
ments  as  the  Diagnostic  Rhyme  Test  from  the  standpoints  of  robust- 
ness, reliability,  and  validity.  Highly  reliable  results  can  be 
obtained  from  the  DRT  with  minimum  regard  for  the  selection  and 
management  of  the  listening  crew,  but  this  is  not  yet  the  case 
with  FARM.  However,  the  means  of  achieving  such  refinement  are 
rather  clearly  indicated  by  the  results  of  research  thus  far 
performed,  and  a number  of  fairly  specific  recommendations  can 
be  made  at  this  point. 

A. 5.1  Selection  of  Listening  Crews  - For  all  but  the  most 

preliminary  evaluations,  a listener  crew  of  10  or  more  carefully 
selected  listeners  is  recommended.  It  is  recommended  t’-'at 
listeners  be  selected  on  the  basis  of  performance  on  a probe 
FARM  set,  where  the  criteria  for  selection  are  self  consistency 
and  conformity  with  previously  established  norms  for  selected 
system- condition . 

A. 5.2  Selection  of  Speakers  - It  is  recommended  chat  a 

minimum  of  three  male  speakers,  selected  bv  means  of  a semantic 
differential  voice  rating  form,  (e.g.,  as  used  by  Voiers , 196A) 
be  used  for  routine  system  evaluation.  Alternatively  speaker 
selection  may  be  based  on  data  yielded  by  FARM,  for  a repre- 
sentative sample  of  system-conditions. 

4.5.3  FARM  Format  - It  is  recommended  that  the  inherent 

redundancy  of  FARM  be  substantially  reduced  and  that  other  steps 
be  taken  to  control  intra-FARM  listener  variation.  Specific 
steps  to  these  ends  should  include: 
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1.  Abandonment  of  the  paired  utterance  feature. 

2.  Reduction  in  the  number  of  presentations  of 
all  conditions. 

3.  Inclusion  of  one  or  more  midrange  probe  con- 
ditions in  all  PARMs  with  post-experimental 
adjustment  of  each  listener's  data  on  the 
basis  of  his  ratings  of  the  probe  conditions. 

4.  Increase  in  the  number  of  system  conditions 
included  in  each  FARM  set  from  four  to  six. 

4.5.4  Statistical  Control  of  Long-term  Adaptation  Level 

Drift  - It  is  recommended  that  a standard  PARM-set  be  period- 
ically administered  to  FARM  crews  and  that  crew  deviations 
from  the  normative  response  to  the  standard  set  be  used  as  a 
basis  for  adjusting  the  data  obtained  from  the  crews  during 
the  particular  epoch  involved. 

4.6  Overview 


In  the  foregoing  sections  evidence  with  regard  to 
the  intrinsic  validity  and  reliability  of  PAR>1  has  been  pre- 
sented, It  is  concluded  that  FARM  can  provide  a highly  reliable 
and  valid  measure  of  system  acceptability  to  the  population 
represented  by  the  listening  crew.  Various  recommendations  have 
been  made  to  increase  its  reliability,  validity,  and  cost  effec- 
tiveness. However,  the  effect  of  these  recommendations  is  to 
dispense  with  a number  of  the  features  chat  distinguish  FARM  as 
an  evaluation  method. 
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Far  from  least  among  PARM's  contributions  to  the 
technology  of  acceptability  evaluation  has  been  that  of  pro- 
viding the  means  of  determining  which  control  features  are 
important  and  which  are  trivial.  Only  through  the  use  of  such 
an  instrument  as  FARM  could  one  make  this  determination  and 
confidently  dispense  with  various  of  the  controls  which  it 
originally  incorporated.  FARM  as  initially  conceived,  has  thus 
served  both  as  a valuable  research  tool  and  as  an  interim  instru- 
ment for  practical  acceptability  evaluation.  Now  perhaps,  it 
should  be  abandoned  in  favor  of  modifications  or  new  methods 
which  take  better  advantage  of  the  principles  which  it  has  served 
to  elucidate. 

The  Quality  Acceptance  Rating  Method  (QUART) , described 
in  the  next  chapter  represents  one  new  method  which  was  developed 
and  refined  largely  on  the  basis  of  insights  gained  through  experi 
ments  with  FARM. 


5.0 


INVESTIGATION  OF  THE  SEUANTIC  DIFFERENTIAL  APPROACH 
TO  ACCEPTABILITY  EVALUATION;  DEVELOPMENT  OF  THE 
QUALITY  ACCEPTABILITY  RATING  TEST. 


5 . 1 The  Semantic  Differencial  Approach 

The  semantic  differential  approach  was  originally 
developed  by  Osgood  (1952)  to  provide  a comprehensive  method  of 
quantifying  meaning.  It  has  subsequently  found  application  to 
a diversity  of  problems , the  solutions  to  which  require  par- 
simonious, quantitative  characterizations  of  complex  cognitive 
processes.  Most  relevant  in  the  present  context  is  the  use- 
fulness of  the  method  for  characterizing  the  perceptual  cor- 
relates of  complex  physical  stimuli,  for  examole , the  perceo- 
tuall.y  distinctive  characteristics  of  individual  voices  (Voiers , 
196A)  of  passive  sonar  sounds  (Solomon,  1958,  1959a,  1959b),  and 
of  complex  visual  forms  (Elliott  and  Tannenbaum,  1963). 

The  classic  semantic  differential  method  involves  a 
set  of  rating  scales,  each  of  which  is  defined  by  an  antonymous 
pair  of  adjectives,  for  example,  good :bad , black :white , and 
heavy ; light . The  respondent's  task  is  to  assign  each  concept, 
object  or  stimulus  being  investigated  a value  on  each  scale. 
Depending  upon  the  problem  being  addressed,  the  basic  procedure 
has  been  modified  in  various  respects.  For  example,  Voiers, 
(1965)  has  used  pairs  of  word  clusters  rather  than  single-word 
pairs  to  define  semantic  continua,  the  choice  of  words  comprising 
each  cluster  being  based  on  results  of  preliminary  investigations 
which,  themselves,  employed  the  semantic  differential  aoproach . 
Such  clusters  were  designed  to  reduce  the  subject's  uncertainty 
as  to  the  nature  of  each  perceptual  continuum  involved  or  as  to 
the  meanings  of  individual  terms. 


70 


71 


Although  it  is  theoretically  possible  to  determine 
the  "semantic  coordinates"  of  virtually  any  object  or  concept 
by  using  scales  defined  with  such  general  terms  as  "good-bad," 
"large-small,"  "beautiful-ugly,"  and  so  on,  the  use  of  terms 
having  more  immediate  relevance  in  a particular  context  (e.g., 
loud-soft,  high-low,  in  the  case  of  acoustical  stimuli)  can 
be  expected  to  increase  the  precision  and  economy  of  the  method. 
It  is  important,  however,  that  technical  jargon  be  avoided, 
except  where  it  can  be  assumed  that  the  subjects  involved  are 
fully  acquainted  with  the  meanings  of  the  jargon  expressions 
or  terms.  A major  purpose  of  the  semantic  differential  approach 
in  a psychophysical  context  is,  in  fact,  to  develop  a common 
language  by  means  of  which  individuals  can  communicate  their 
sensory-perceptual  and  effective  experiences. 

Regardless  of  the  number  of  scales  employed,  subjects 
in  semantic  differential  experiments  most  often  respond  in  ways 
which  indicate  that  a very  limited  number  of  orthogonal  para- 
meters (typically  three)  can  account  for  the  systematic  component 
of  their  responses  on  the  various  scales.  However,  the  use  of 
a greater  number  of  scales  is  desirable  to  insure  a comprehen- 
sive inventory  of  the  subject's  perceptual  reactions  to  the 
stimuli  or  cencepls  involved.  Normally,  then,  the  semantic  dif- 
ferential provides  highly  redundant  characterizations  of  the 
subject's  response.  Factor  analysis  or  a related  technique  is 
then  employed  to  determine  the  number  and  nature  of  the  underly- 
ing or  implicit  parameters  of  the  subject's  response  to  the 
stimuli  or  concepts  involved. 

A particularly  useful  property  of  the  semantic  dif- 
ferential approach  is  that  it  permits  the  simultaneous  asses- 
ment  of  the  affective  or  evaluative  and  the  perceptual  or  non- 
evaluative  aspects  of  a subject's  response  to  the  stimulus 
conditions  involved.  Thus,  it  can  be  used  not  only  to  identify 
the  perceptual  correlates  of  various  types  and  degrees  of  speec>' 


72 


signal  degradation,  but  also  to  determine  their  interrelations 
with  each  other  and  with  the  evaluative  aspect  of  the  subject's 
response.  It  can  be  used,  for  example,  not  only  to  gauge  the 
acceptability  of  processed  speech  but  also  to  provide  insights 
concerning  the  perceived  characteristics  which  govern  the 
listener's  evaluative  reaction  to  such  speech. 

5.2  Development  of  the  Quality  Acceptability  Rating 

Test  (QUAITT 

For  the  development  and  validation  of  QUART  it  was 
necessary,  first,  to  obtain  speech  samples  representing  the 
diverse  forms  of  speech  processing  and  degradation  likely  to 
be  encountered  in  communication  situations  of  the  present  and 
foreseeable  future.  Speech  materials  representing  various 
simple  forms  of  degradation,  plus  materials  that  had  been  pro- 
cessed by  various  digital  voice  communication  systems,  were 
available  for  these  purposes.  These  materials  consisted  of 
ninety  six-syllable,  phonemically-controlled  sentences.  Thirty 
of  these  were  spoken  by  each  of  three  male  speakers.  They 
were  presented  at  an  approximate  rate  of  one  sentence  every 
four  seconds. 

In  the  first  of  a succession  of  pilot  studies  a 
semantic  differential  rating  foirm  involving  24  scales  (see 
Figure  5.1)  was  used  by  several  samples  of  listeners  to  describe 
their  perceptions  of  the  various  types  of  speech  processing  and 
degradation  and  to  indicate  the  degree  of  acceptability  they 
would  accord  each  type.  Factor  analysis  of  the  results  indi- 
cated the  existence  of  four  orthogonal  parameters  of  the  typical 
listener's  response.  It  also  provided  some  useful  insights 
concerning  the  interrelations  among  various  perceived  system 
characteristics  and  system  acceptability.  Additionally,  it 
revealed : 
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1.  Several  "silent"  scales  (i.e.,  scales  for  which 
listeners  responses  provided  little  or  no  basis 
for  discrimination  among  the  system-conditions 
involved. ) 

2.  Several  highly  redundant  scales. 

3.  Insufficient  discrimination  among  some  system 
conditions . 

On  the  basis  of  these  findings,  a number  of  items  were  deleted 
or  modified,  and  new  items  introduced. 

Over  the  course  of  five  additional  pilot  studies,  the 
number  of  semantic  rating  scales  was  reduced  to  twelve,  plus 
a 100-point  acceptability  rating  scale.  A rating  form  based 
on  these  scales  is  shown  in  Figure  5.2. 

5 . 3 Experimental  Validation  of  QUART 

5.3.1  Materials , Method  and  Procedure 

5. 3. 1.1  Experimental  Conditions  - To  validate  the  QUART 
concept,  generally,  and  System  Rating  Form  III,  in  particular, 
speech  samples  representing  20  system-conditions  and  6 forms  of 
laboratory  degradation  were  presented  to  35  listeners,  who  used 

a version  of  System  Rating  Form  III  to  indicate  their  perceptions 
and  evaluations  of  these  conditions.  The  conditions  (and  the 
abbreviations  used  in  subsequent  discussions)  were  as  follows : 

Laboratory  Conditions 

1,  (H)  Undegraded  speech,  lowpass  filtered  at  4 kHz. 

2.  (L)  Speech  processed  sequentially  by; 

a.  A 2.4  kbps  linear  predictor  with  1%  bit 


error  rate. 


SYSTEM  rating  form  lA 
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I 
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Fig.  5.1  Preliminary  OUART  Rating  Form 


ynastat 


System 

Rater 

Dace 


SYSTDl  RATING  FORM  III-A 


CONTINUOUS 

SUSTAINED 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

INTERRUPTED 

INTERMITTENT 

THUMPING 

THUDDING 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

CLICKING 

TICKING 

RATTL.ING 

PATTERING 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

BUZZING 

DRONING 

CRACKLING 

CLATTERING 

< 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

SQUISHING 

PLOPPING 

NATURAL 

HUMAN 

{ 

\ 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

UNNATURAL 

MECHANICAL 

SIMMERING 

SEETHING 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

CHIRPING 

CHEEPING 

DIRTY 

CLUTTERED 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

CLEAN 

UNCLUTTERED 

SHARP 

PIERCING 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

/ 

s 

) 

( 

) 

DULL 

MUFFLED 

RUSHING 

GUSHING 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

BABBLING 

GURGLING 

GUTTURAL 

THICK 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

) 

( 

> 

NASAL 

THIN 

UNINTELLIGIBLE 

GARBLED 

( 

) 

{ 

) 

( 

) 

( 

) 

< 

) 

( 

) 

( 

) 

INTELLIGIBLE 

DISTINCT 

FLUTTERING 

TWITTERING 

( 

) 

( 

) 

( 

) 

( 

) 

< 

) 

( 

) 

( 

) 

SCRATCHING 

SCRAPING 

H('w  would  you  rate  thjs  sytem  on  a 100  point  scale  of  overall  accept- 
ability?   ( ) 


(Asvunio  that  a typical  telephone  would  receive  a rating  of  90) 


Fig.  5.2  System  Rating  Form 
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3.  (9  dB) 


4.  (CLP) 

5.  (Int.) 

6.  (2  kHz) 

Sil 

1.  (4.8L-0) 


2.  (3.6L-0) 


3.  (2.4L-0) 

4.  (A- '/ 


5.  (H-5) 

6.  (32C-0) 

7.  (16C-0) 

8.  (9.6C-0) 

9.  (P) 

1'^  (A-C) 

M (C-A) 


b.  An  clY-2  channel  vocoder. 

c.  A 9.6  kbps  CVSD  with  57o  bit  error  rate. 

d.  4 kHz  noisy  channel  which  provided  a 
procosstd  speech/noise  ratio  of  22  dB 
in  the  passband. 

Unprocessed  speech  with  additive  filtered  white 
noise,  providing  a speech/noise  ratio  of  9 dB, 
measured  in  a 4 kHz  passband. 

Peak  clipped  speech. 

Interrupted  speech  with  an  interruption  rate  to 
150  ips  and  50T4  duty  cycle. 

Unprocessed  speech  lowpas,.  ».  Itered  at  2 kHz. 
Stem-Conditions 


Linear  predictor  system  at  a 4.8  (2.7  kbps  speech 
data)  kbps  t 'ansmission  rate  and  07,  bit  error 
rate  (2.1  kbps  used  for  error  protection). 

Linear  predictor  system  at  a 3.6  (2.7  kbps  speech 
data)  kbps  transmission  rate  and  07,  bit  error  rate 
(0.9  kbps  used  for  error  protection). 

Linear  pr^  dictor  operating  at  2.4  kbps. 

A adaptive  predictive  coder  operating  at  8.0  kbps 
(four  cf  ifficients  plus  quantized  error  signal 
and  pitch  period  indication) . 

HY-2  channel  vocoder  (2.4  kbps). 

Continuously  variable  slope  delta  modulation 
system  (CVSD)  operating  at  32  kbps. 

CVSD  operating  at  16  kbps. 

CVSD  operating  at  9.6  kbps. 

Parkhill  (20  dB  S/N). 

Anr  -)ccder  in  tandem  with  16  kbps  CVSD. 

CVSD  in  tandem  with  Army  vocoder. 
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12  . 

(4.8L-5) 

Linear  predictor  at  4.8  kbps 
bit  error  rate  (ber) . 

(2.7 

kbps)  with 

5% 

13  . 

(3.6L-5) 

Linear  predictor  at  3.6  kbps 

(2.7 

kbps)  with 

57, 

ber. 

14. 

(2.4L-5) 

Linear  predictor  at  2.4  kbps 

with 

57,  ber . 

15. 

(A-5) 

An  APC  with  57o  ber. 

16. 

(H-5) 

HY-2  vocoder  with  57,  ber. 

17. 

(32C-5) 

eVSD  at  32  kbps  with  57,  ber. 

18. 

(16C-5) 

evSD  at  16  kpbs  with  57,  ber. 

19. 

(9.6C-5) 

eVSD  at  9.6  kbps  with  57,  ber, 

20. 

(CMV) 

CONUS  Median  Voice  grade  link. 

5. 3. 1.2  Listeners  - The  listening  crew  was  composed  of  males 
and  females  between  the  ages  of  18  and  29.  All  had  survived  a 
screening  and  training  regimen  which  involved  pure  tone  audio- 
metry, the  Diagnostic  Rhyme  Test,  the  Paired  Acceptability 
Rating  Method,  and  QUART,  itself. 

5. 3.1. 3 Speakers  - Recordings  by  two  male  speakers,  CH  and  LL, 
provided  the  speech  materials  for  this  investigation.  CH  is  a 
relatively  low-pitched  speaker,  LL  a relatively  high-pitched 
speaker . 

5.3.2  Experimental  Design  and  Procedure  - Test  materials 

spoken  by  the  speakers  were  counterbalanced  across  listening 
crews.  Approximately  half  the  listeners  heard  the  materials 
spoken  by  CH.  Following  a short  break,  they  then  heard  the 
materials  spoken  by  speaker  LL.  This  order  was  reversed  for 
the  remaining  listeners.  In  both  cases,  the  laboratory  pro- 
cessed speech  materials  were  presented  first  and  in  the  same 
order.  Following  the  laboratory  conditions  samples  represent- 
ing the  various  system-conditions  were  presented  in  a randomly 
determined  order  in  the  case  of  one  sneaker  and  in  the  reverse 
order  in  the  case  of  the  other  speaker. 
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A standard  and  an  alternate  version  of  the  rating 
form  was  used.  With  both  versions  the  subject's  final  task  was 
CO  rate  the  system-condition  involved  on  a 100-point  scale  of 
acceptability.  The  versions  differed  only  in  that  the  order 
and  polarities  of  the  rating  scales  were  reversed  in  the  case 
of  the  alternate  form. 

5.3.2. 1 Instructions  to  subjects  - A standard  set  of  instruc- 
tions (Appendix  A)  was  read  to  each  crew.  Crew  members  were 
then  encouraged  to  ask  questions  as  needed  to  clarify  their 
understanding  of  the  task. 

5. 3. 2. 2 Familiarization  V7ith  test  materials  - Prior  to  the 
racing  session  proper,  the  subjects  were  allowed  to  hear  a 
sample  sentence  representing  each  of  the  26  laboratory-and 
system-conditions.  They  were  instructed  not  to  rate  these 
samples  but  simply  to  attend  to  them  as  a means  of  experiencing 
the  range  and  diversity  of  speech  qualities  involved,  and  of 
establishing  a reference  frame  in  terms  of  which  to  make  their 
ratings . 

5.3.3  Analysis  of  Results  - Since  the  interaction  of 

speakers  and  systems  was  negl  .gible , data  for  the  two  speakers 
were  combined  for  purposes  of  the  following  analyses  No 
further  analysis  of  data  for  individual  speakers  was  undertaken 
for  purposes  of  this  investigation. 

Each  of  the  12  semantic  scales  was  assigned  an  arbi- 
trary polarity.  Numbers  from  "one"  to  "seven"  were  then  assigned 
to  the  seven  scale  categories.  Insofar  as  possible  on  an 
a priori  basis , polarities  were  determined  such  that  higher 
scale  values  were  associated  with  favorable  connotations,  lower 
scale  values  with  unfavorable  connotations.  An  example  is. 
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"intelligible-distinct"  which  clearly  has  a more  favorable 
connotation  than  "unintelligible-garbled . " In  some  instances 
where  both  characteristics  have  unfavorable  connotations  (for 
example  "chirping-cheeping"  versus  "simmering-seething")  a 
neutral  rating  of  "four"  is  the  most  favorable  rating.  To 
make  fullest  use  of  such  bipolar  scales,  additional  scoring 
procedures  were  introduced.  Specifically,  data  for  Scales 
3,  A,  6,  9,  and  12  were  evaluated  first  in  a normal  manner 
and  were  then  transformed  to  yield  a second  variable  in  each 
case.  This  second  or  derived  variable  was  based  on  absolute 
deviations  from  the  neutral  rating  of  "four."  Thus  a total 
of  18  variables  (including  the  acceptability  rating)  became  avail- 
able for  purposes  of  characterizing  listeners'  reactions  to  the 
various  laboratory  and  system  conditions. 

5.3,4  Results  and  Discussion  - Table  5.1  presents  the  aver- 

age rating  received  by  each  of  the  26  conditions  on  each  of  the 
13  primary  variables  and  the  5 derived  variables  . Vford  pairs 
at  the  top  and  bottom  of  each  column  identify  the  upper  and  lower 
extremes  of  each  continuum.  System  differences  with  respect  to 
both  primary  and  derived  variables  are  evident,  and  various 
trends  can  be  detected  on  close  scrutiny. 

Means,  standard  deviations,  and  F-ratios  for  condi- 
tions are  presented  for  each  variable  in  Table  5.2.  Differences 
among  the  variables  in  terms  of  discriminating  power  are  evident. 
Generally,  those  variables  which  Involved  evaluative  reactions 
discriminate  mcst  effectively  among  the  26  conditions.  However, 
all  of  the  variables,  both  primary  and  derived,  possess  a high 
degree  of  discriminating  power,  as  attested  to  by  F-ratios 
which  were  significant  at  well  beyond  the  .01  level  in  all 
instances . 
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TABLE  5.2  Means,  Standard  Deviations  and  F-ratios  for  QUART  Scales 


SCALE 

MEAN 

S.D. 

F-ratio  for 
System-Condition* 

1. 

CONTN 

SUSTN 

VS 

INTRP 

INTPJ1 

3.9 

1.12 

51.8 

2. 

CLICK 

TICK 

VS 

THUMP 

THUD 

4.2 

.36 

7.3 

3. 

CLATR 

PATTR 

VS 

BUZZ 

DRONE 

3.9 

.67 

16.7 

4. 

CRAKL 

CLATR 

vs 

SQUISH 

PLOP 

4.3 

.57 

11.4 

5. 

NATRL 

HUMAN 

vs 

UNATR 

MECHN 

3.2 

1.23 

107.5 

6. 

CHIRP 

CHEEP 

vs 

SIMMR 

SEETH 

4.0 

.91 

41.1 

7. 

CLEAR 

UNCLU 

vs 

DIRTY 

CLUTR 

3.0 

1.37 

133.3 

8. 

SHARP 

PIERC 

vs 

DULL 

MUFLD 

3.5 

.44 

9.5 

9. 

BABBL 

GURGL 

vs 

RUSH 

GUSH 

4.3 

1.20 

71.3 

10. 

NASAL 

THIN 

vs 

GUTRL 

THICK 

3.9 

.41 

8.7 

11. 

INTLG 

DISTC 

vs 

UNINT 

GARBL 

3.7 

1.31 

138.4 

12. 

FLUTR 

TWITR 

vs 

SCkAT 

SCRAP 

3.8 

1.33 

87.4 

13. 

( 

3D)-.--.v  BUZZ 
^ ^ CLATR 

vs 

NUTRL 

.5 

.39 

13.9 

14. 

( 

SQUISH 
''  CRAKL 

vs 

NUTRL 

.4 

.48 

20.9 

15. 

( 

SIMMR 
^ CHIRP 

vs 

NUTRL 

.7 

.60 

23.5 

16. 

( 

gny.v*  RUSH 
''  BABBL 

vs 

NUTRL 

1.0 

.74 

43.3 

17. 

(12D)**  SCRAT 
^ FLUTR 

vs 

NUTRL 

1.0 

.85 

51.0 

18. 

ACCPT 

VA 

UNACP 

50.7 

17.29 

230.8 

*F  = 

M 

.S.  Conditions/M. 

S.  Conditions  x Listeners 

V.'ith  25  and  850  degrees  of  freedom, 

P < .01  for  F > 1.18 
**Derived  variables 
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5 . 3 . 4 . 1 Dimensionality  of  Listener  Response  to  System- 
Conditions  - By  design,  the  semantic  differential  approach 
provides  a redundant  characterization  of  the  listener's  per- 
ception of  the  individual  system-condition.  This  is  evident 
from  Table  5. 3,  which  shows  the  intercorrelations  among  the 
18  primary  and  derived  variables.  Clearly,  fever  than  18 
dimensions  are  required  to  characterize  listener  response  to 
a system-condition.  The  nature  and  number  of  the  underlying 
dimensions  of  listener  response  thus  become  issues  in  need  of 
resolution.  Factor  analysis  was  used  for  this  purpose. 

The  correlation  matrix  in  Table  5.3  was  subjected 
to  factor  analysis  by  the  principle  components  method.  Five 
orthogonal  factors  were  found  to  account  for  the  systematic  or 
reliable  component  of  listener  response  to  the  26  conditions. 
Following  rotation  of  axes  to  a Varimax  criterion  of  simple 
structure,  further  minor  rotations  were  made  in  order  to  obtain 
the  psychologically  most  meaningful  set  of  factors.  The  matrix 
of  factor  loadings  yielded  by  these  procedures  is  shown  in 
Table  5.4. 


The  pattern  of  factor  loadings  in  Table  5.4  provides 
an  adequate  basis  for  identifying  the  five  factors  in  psycho- 
logical or  subjective  terms.  However,  some  additional  insights 
are  to  be  gained  from  an  examination  of  the  configuration  of 
the  system-conditions  in  the  data  space  defined  by  the  five 
factors,  i.e.,  a hyperspace  whose  primary  axes  are  factors  rather 
than  explicit  variables.  Table  5.5  contains  the  coordinates  of 
the  26  laboratory  and  system-conditions  in  the  factorial  data 
space,  where  the  origin  and  scale  have  been  transformed  such 
that  the  means  of  all  five  distributions  of  factor  scores  fall 
at  50  and  the  standard  deviations  reflect  the  reliabilities  of 
scores  in  each  dimension.  The  effect  of  these  transformations 


TABLE  5.3  Intercorrelations  of  QUART  Ratings  of  System-Conditions 
by  the  ^Professional  Listener  Sample 
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and  neutral  (4)  ratings  on  all  primary  Benantic  scales. 
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is  to  preserve  psychological  distance  relationships  among 
system-conditions  with  some  degree  of  accuracy.  Also  shown 
are  the  coordinates  of  a hypothetical  subjectively  neutral 
system-condition  for  both  the  professional  listener  sample 
and  for  the  "target  sample"  (see  Chapter  6) . Projections 
of  these  coordinates  on  selected  planes  of  the  factorial  data 
space  are  shown  in  Figures  5.3.1  - 5.3.4. 

Factor  I - Overall  Acceptability  - A factor  loading 
of  .98  in  the  case  of  the  acceptability  scale  coupled  with 
high  loadings  on  other  evaluative  scales  identifies  Factor  I 
as  the  affective  or  evaluative  component  of  the  listener's 
reactions  to  the  26  conditions.  Table  5.5  and  Figures  5.3.1  - 
5.3.4  show  the  various  system-conditions  to  be  ordered  in  a 
manner  which  is  consistent  with  this  interpretation. 

Further  examination  of  the  pattern  of  loadings  on 
Factor  1 provides  some  insights  concerning  the  antecedents  or 
correlates  of  acceptability  in  the  present  instance.  Partic- 
ularly noteworthy  is  the  high  loading  of  Scale  1.  Evidently, 
perceived  temporal  continuity  of  the  speech  signal  was  a major 
consideration  in  the  relative  acceptabilities  of  the  26  con- 
ditions involved  here.  Conditions  that  were  perceived  to 
preserve  the  temporal  continuity  of  the  speech  signal  were 
generally  regarded  with  greater  favor  than  those  for  which  the 
signal  was  perceived  as  interrupted  or  intermittent.  Also 
noteworthy  is  the  high  loading  of  the  intelligibility  scale  on 
this  factor,  indicating  that  perceived  intelligibility  is  a major 
condition  of  overall  acceptability. 

Listeners  placed  a high  premium  on  naturalness , 
cleaness , sharpness , and  nasality  (as  opposed  to  gutturality) . 
High  negative  loadings  on  the  derived  variables  D3,  D4 , D6 , D9 , 
and  D12  suggest  that  they  looked  on  all  forms  of  degradation 
with  some  disfavor.  Forced  to  choose,  however,  they  favored 


87 


conditions  involving  noise-like  degradation  over  conditions 
involving  various  types  of  distortion.  More  specifically, 
negative  loadings  in  the  cases  of  Scales  2,  3,  4,  6,  9,  and  12 
indicate  that  listeners  preferred: 


System- conditions 
characterized  as: 

System- conditions 
charac terized  as; 

Thumping 

Thudding 

Clicking 

Ticking 

Buzzing 

Droning 

Rattling 

Pattering 

Squishing 

Plopping 

TO 

Crackling 

Clattering 

Simmering 

Seething 

Chirping 

Cheeping 

Rushi ng 

Gushing 

Babbling 

Gurgling 

Scratching 

Scraping 

Fluttering 

Twittering 

It  must  be  stressed,  however,  that  the  relative  preferences 
indicated , with  respect  to  these  qualities,  are  undoubtedly  deter- 
mined to  a significant  degree  by  the  composition  of  the  limited 
sample  of  conditions  available  for  this  investigation.  Extreme 
caution  should  be  exercised  in  extrapolating  or  generalizing 
these  results  beyond  the  present  sample  of  system-conditions. 

Factor  II  - Babbling-Chirping  - This  factor  is  defined 
by  a number  of  scales,  all  of  which  would  appear  to  describe  a 
time-varying  form  of  degradation  as  opposed  to  a temporally 
continuous,  or  noise-like  form  of  degradation. 

Support  for  this  interpretation  is  provided  by  the 
configuration  of  data  points  in  Figure  5.3.1.  From  the  listener's 
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FACTOR  I --  ACCEPTABILITY 

Fig.  5.3.1  Conf iguration  of. System-conditions  in  the 
I X IT  Plane  of  the  Factor  Space 


Point  of  subjective  neutrality  for  orofessional 
listener  sample. 


+ Point  of  subjective  neutrality  for  target 
sample 
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standpoint,  it  is  this  non-evaluative , perceptual  quality  that 
most  conspicuously  distinguishes  the  delta  modulation  systems 
from  the  narrowband  analysis-synthesis  systems. 

Factor  III  - General  Degradation  - This  factor  is 
defined  entirely  by  derived  rating  items.  To  the  extent  that 
a system-condition  has  a non-neutral  status  with  respect  to 
such  perceptual  continua  as  chirping- simmer ing  and  fluttering- 
scratching  it  is  characterized  by  this  factor.  Figure  5.3.2 
shows  the  configuration  of  system-conditions  in  this  dimension 
of  the  factor  space.  Conditions  involving  digital  transmission 
errors  tend  to  rank  highly  on  this  dimension  but  other  forms  of 
degradarion  are  also  condusive  to  high  rankings  in  this  dimen- 
sion . 


Factor  IV  - Clicking-Clattering  - This  factor  in 
combination  with  Factor  III,  effectively  segregates  system  con- 
ditions in  which  bit  errors  occur  (as  shown  in  Figure  5.3.3), 
though  the  two  factors  are  defined  by  different  rating  scales. 

The  seemingly  redundant  functions  of  these  two  factors  is 
probably  due  to  the  fact  that  bit  errors  provide  the  predominant 
form  of  degradation  in  the  sample  of  system-conditions  used  in 
this  investigation.  The  low  standing  of  the  9 dB  S/N  on  this 
factor  suggests  that  it  represents  a noise  versus  distortion 
oppo.sition.  However,  further  research  involving  more  diverse 
forms  of  degradation  will  be  required  to  clarify  this  issue. 

Factor  V - Sharpness-Nasality  - TTiis  factor  is  defined 
by  two  scales  which  were  conceived  in  an  attempt  to  capture  the 
perceptual  characteristics  that  distinguish  vocoders  from  other 
narrowband  systems.  The  attempt  was  not  successful,  but  the 
factor  evidently  discriminates  among  systems  on  the  basis  of  other 
characteristics,  as  shown  in  Figure  5. 3. A 
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Fip^.  5.3.2  Configuration  of  System-conditions  in  the 
I X III  Plane  of  the  Factorial  Data  Space 
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It  is  evident  that  the  precise  nature  and  number  of 
the  perceptual  parameters  of  degraded  speech  have  yet  to  be  con- 
clusively defined.  To  do  so  will  require  further  research 
involving  a greater  diversity  of  system-conditions  than  was 
available  for  this  investigation.  Examinations  of  the  factor 
loading  of  the  QUART  scales  and  the  configuration  of  factor 
scores  for  system-conditions  strongly  suggests  that  several 
potentially  independent  perceptual  parameters  tended  to  covary 
in  this  limited  sample  of  system-conditions,  but  are  potentially 
independently  variable.  More  generally,  the  problem  of  iden- 
tifying factorial  dimensions  is  complicated  by  the  relatively 
restricted  sample  of  system-conditions  used  in  this  investiga- 
tion; the  bulk  of  this  sample  falls  within  a relatively  circum- 
scribed region  of  the  perceptual  space  defined  by  the  five 
factors.  In  Figure  5.3  it  may  be  seen  that  the  centroid  of 
the  configuration  of  systems  in  the  factor  space  does  not  lie 
at  the  point  of  subjective  neutrality  i.e.,  the  point  repre- 
senting a hypothetical  system-condition  that  would  receive  an 
acceptability  rating  of  50  and  neutral  ratings  on  the  twelve 
primary  semantic  rating  scales. 

In  view  of  the  foregoing  considerations,  judgment  as 
CO  the  exact  nature  anci  number  of  the  elementary  perceptual  para- 
meters of  speech  quality  must  be  reserved  at  this  time.  But 
whatever  the  factorial  structure  of  listeners'  perceptions  of 
system-conditions,  the  rating  data  yielded  by  QUART  have  some 
immediate  practical  value. 

5. 3.4. 2 Predictive  Validity  of  QUART  - Individual  rating 
scales,  both  evaluative  and  non-evaluative , have  substantial 
potential  for  predicting  system  acceptance  by  the  user  population. 
Evidence  of  this  is  provided  by  Table  5.6  which  shows  the  cor- 
relations between  average  semantic  ratings  of  system-conditions 
and  average  acceptability  ratings  by  the  target  sample.  Also 
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TABLE  5.6  Correlations  Between  Semantic  Differential 
Ratings  and  Target  Sample  Acceptability 


Ratings . 

Correlation  with 
Acceptabili ty 

Target  Samnle 
Rating 

Rating  Scale 

Prof.  List.  Sample 

Target  Samole 

1. 

Cont-Sustained 

.93 

.98 

2. 

Click-Tick 

-.36 

-.25 

3. 

Clatter-Patter 

- 33 

-.35 

A, 

Crackle -Clatter 

-.12 

.11 

5. 

Natural-Human 

.97 

.97 

6. 

Chirping -Cheeping 

-.39 

-.70 

7 . 

Clean -Uncluttered 

.96 

.95 

8. 

Sharp-Piercing 

.61 

.87 

9. 

Babbling-Gurgling 

-.A5 

-.68 

10. 

Nasal-Thin 

. A7 

.78 

11. 

Intelligible -Distinct 

.99 

.99 

12. 

Fluttering -Twittering 

-.18 

-.A9 

13. 

Clattering -Buzzing 

-.  A7 

-.37 

lA. 

Crack ling -Squishing 

-.31 

-.78 

15. 

Chirping- Simmering 

-.63 

-.86 

16. 

Babb ling- Rushing 

-.73 

-.75 

17  . 

Fluttering- Scratching 

-.67 

-.35 

18. 

Acceptability 

.98 

— 
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shown  for  comparative  purposes  are  correlations  between  average 
semantic  ratings  of  system-conditions  by  the  professional  lis- 
tener sample  and  acceptability  ratings  by  the  target  sample. 

A correlation  of  .98  between  acceptability  ratings  by  the  target 
sample  and  acceptability  ratings  by  the  professional  listener 
sample  implies  that  the  two  groups  strongly  agree  on  the  re lat ive 
merits  of  the  various  system-conditions.  This  implication  is 
borne  out  by  the  pattern  of  correlations  between  acceptability 
ratings  by  the  target  sample  and  semantic  ratings  by  both  groups. 
The  target  sample' s rat ings  of  continuity , naturalness , clarity , 
and  intelligibility  are  highly  correlated  with  its  ratings  of 
acceptability.  Corresponding  semantic  ratings  by  the  profes- 
sional listener  sample  are  only  slightly  less  correlated  with 
the  target  sample's  acceptability  ratings.  The  latter  results 
provide  a strong  indication  of  the  feasibility  of  predicting 
user  acceptance  from  QUART  data  yielded  by  laboratory  listeners. 
Further  indications  are  provided  by  a comparison  of  samples  from 
these  two  populations  in  terms  of  how  they  perceive  the  dif- 
ferences among  representative  system- cond it  ions . To  this  end, 
semantic  differential  rating  data  obtained  from  the  target 
sample  were  subjected  to  factor  analysis.  As  in  the  case  of  the 
professional  listener  sample,  five  interpretable  factors  were 
obtained . 


The  axes  of  the  original  factor  space  for  the  target 
.sample  were  rotated  to  maximize  their  congruence  with  the  axes 
on  the  factor  space  of  the  professional  listening  crew  (Veldman, 
1967).  The  resulting  factor  matrix  is  presented  in  Table  5.7. 
Also  shown,  for  purposes  of  comparison,  is  the  matrix  yielded  by 
the  professional  listening  crew.  Virtually  perfect  congruence 
of  the  corresponding  axes  was  achieved.  Shown  In  Table  5.8  are 
cosine.s  between  individual  scale  vectors  (i.e.,  coefficients  of 
correlation  between  ratings  by  professional  and  target  samples). 


TABLE  5.7  Factorial  Structures  of  QUART  Ratings 
Professional  Listener  Sample  and  Target  Sample 


I 
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From  the  foregoing  results  it  is  clear  that  the  two 
samples  discriminated  systems  with  respect  to  essentially  the 
same  perceptual  parameters , although  there  are  minor  indications 
that  they  value  some  perceptual  qualities  somewhat  differently. 

For  example,  the  loadings  of  Scale  18  (acceptability)  on  Factors 
II  and  III,  though  small,  are  somewhat  higher  for  the  target 
sample  than  for  the  professional  listener  sample.  The  practical 
and  theoretical  implications  of  these  differences  would  appear  to 
be  rather  trivial , particularly  when  it  is  recalled  that  the 
professional  listener  sample  had  undoubtedly  had  more  extensive 
exposure  to  modern  digital  voice  communication  systems  than  the 
typical  member  of  the  target  sample.  Given  a more  broadly 
experienced  target  sample,  or  a less  experienced  professional 
sample,  less  pronounced  differences  might  be  expected.  Further 
examination  revealed  that  the  two  samples  also  differed  in  terms 
of  their  subjective  neutral  points,  or  adaptation  levels,  for  the 
various  perceptual  qualities,  as  is  shown  in  Table  5.5  and 
Figures  5.3.1  - 5.3.4.  In  general,  the  target  sample  tended  to 
be  more  lenient  than  the  professional  listener  sample  in  its 
ratings  of  the  various  conditions.  The  most  likely  explanation 
of  this  discrepancy  is  that  the  target  sample  had  a different  con- 
ception than  the  professional  sample  of  what  is  implied  by 
"routine  communications."  Undoubtedly  there  were  also  individual 
differences  in  this  respect  within  both  the  professional  and 
target  samples.  Pre-exposure  of  listeners  to  a standard,  simulated 
communications  situation  might,  thus,  serve  to  significantly  improve 
the  reliability  of  QUART  results. 

5 . 3 . 4 . 3 Practical  Uses  of  QUART  for  the  Prediction  of  User 
Acceptance  of  Communication  Systems  - The  results  described 
above  support  the  hypothesis  that  professional  listeners  and 
potential  system  users  base  their  evaluative  reactions  to 
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communication  systems  on  essentially  the  same  perceptual 
qualities  and  place  similar  values  on  each  of  these  qualities. 
In  any  case,  there  is  a high  correlation  between  professional 
listeners'  perceptions  and  users'  affective  or  evaluative 
reactions  to  processed  speech.  Several  approaches  to  the 
practical  prediction  of  user  acceptance  thus  merit  consider- 
ation. 


Extremely  good  prediction  of  user  acceptance 
reactions  can  be  obtained  using  only  the  acceptability  ratings 
of  a professional  listener  sample.  The  correlation  between 
these  variables  is  shown,  graphically,  in  Figure  5.4.  However, 
the  high  correlations  between  the  perceptual  reactions  (via 
semantic  ratings)  of  professional  listeners  and  acceptability 
ratings  by  the  target  sample,  suggest  that  even  better  pre- 
diction of  user  acceptance  reactions  can  ultimately  be  obtained 
by  the  use  of  multiple  prediction  techniques. 

Unfortunately,  the  sample  of  system-conditions  (20), 
for  which  ratings  by  both  the  professional  listener  and  target 
samples  are  available,  is  far  too  small  to  permit  a valid  test 
of  the  feasibility  of  multiple  prediction  procedures  (or  in  any 
case,  to  yield  a generally  applicable  set  of  regression  coef- 
ficients) . Rating  data  from  a sample  of  system  users  for  a 
large,  representative  sample  of  speech  processing  and  communica- 
tion systems  would  be  very  desirable,  but  in  the  absence  of 
such  data,  a further  step  toward  the  validation  of  the  multiple 
prediction  approach  is  possible.  This  step  requires  the  assump- 
tion that  the  professional  listener  population  and  population 
of  system  users  do  in  fact  value  the  various  relevant  perceptual 
qualities  of  processed  speech  in  essentially  the  same  wav,  which 
assumption  finds  support  from  results  described  above.  The  results 
of  a study  conducted  after  the  formal  termination  of  this  project 
are  then  of  interest. 


ACCEPTABILITY  PvATI'IC  - TARGET  SAMPLE 
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ACCEPTABILITY  FATING  - PROFESSIONA.L  LISTENERS 


Fig.  5.4  Correlation  betv/een  acceptability  ratings  of 
the  target  sample  and  professional  listener  sample 
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These  results  were  yielded  by  QUARTS  conducted  on  a 
large  sample  of  system-conditions  using  Dynastat's  professional 
listener  sample,  only.  A total  of  182  conditions,  including  3 bit 
error  ratesl  for  each  of  37  system-conditions  and  six  probes  (each 
of  wnich  was  rated  nine  times)  were  rated  by  17  professional 
listeners,  using  System  Rating  Form  III  (Figure  5.2). 

The  multiple  correlation  between  the  average  accept- 
ability rating  of  a condition  and  its  ratings  on  the  twelve 
semantic  scales  was  .99.  The  correlations  between  individual 
semantic  scale  and  rated  acceptability  are  shown  in  Table  5.9 
which  shows  the  normalized  regression  coefficients  (betas) 

for  each  semantic  scale.  These  results  demonstrate  the  feasi- 
bility of  predicting  acceptability  from  non-evaluative  rating 
data  or  of  supplementary  results  of  acceptability  ratings  with 
semantic  rating  data  They  have  a number  of  potentially  signif- 
icant implications  for  the  methodology  of  speech  acceptability 
evaluation . 

Although  present  evidence  does  not  support  the  hypo- 
thesis of  qualitative  differences  between  the  value  system  of 
professional  listeners  and  system  users--the  two  samples  discrim- 
inated systems  with  respect  to  the  same  perceptual  qualities  and 
valued  these  qualities  similarly--the  possibility  remains  that 
other  populations  of  system  users  will  be  found  to  apply  a dif- 
ferent system  of  values  in  evaluating  communication  systems. 

(None  of  the  members  of  the  present  target  sample  held  positions 
at  the  command  and  staff  level.)  Given  individuals  with  different 
communications  needs  and  purposes,  one  may  expect  to  find  different 
criteria  of  acceptability  employed.  Isometric  n’ethod.'^  of  accept- 
ability evaluation  will  fail  in  such  circumstances,  but  parametric 
methods,  as  exemplified  above,  can  be  adapted  to  them.  There  is 
some  basi.*',  moreover,  for  predicting  that  the  parametric  approach 
will  prove  less  susceptible  to  the  effects  of  attitudinal  and 
mood  changes  in  the  professional  listener.  It  is  not  difficult 


TABLE  5.9  Correlations  between  Semantic  Ratings  and  Acceptability 
Ratings  of  182  System-Conditions  by  the  Professional 
Listener  Sample 


(+) 

SCALE 

COEFFICIENTS  OF 
CORRELATION 

NORIiALIZED 

REGRESSION 

COEFFICIENTS 

(-) 

SCALE 

CONTINUOUS 

SUSTAINED 

.95 

.18 

INTERRUPTED 

INTERMITTENT 

CLICKING 

TICKING 

-.47 

- .02 

THUMPING 

THUDDING 

RATTLING 

PATTERING 

-.37 

. 03 

BUZZING 

DRONING 

CRACKLING 

CLATTERING 

. 14 

.00 

SQUISHING 

PLOPPING 

NATURAL 

HUMAN 

.95 

.22 

UNNATURAL 

MECHANICAL 

CHIRPING 

CHEEPING 

-.29 

-.06 

SIMMERING 

SEETHING 

n FAN 

UNCLUTTERED 

.96 

. 12 

DIRTY 

CLUTTERED 

SHAPR 

PIERCING 

.40 

. 01 

DULL 

MUFFLED 

BABBLING 

GURGLING 

-.46 

.03 

RUSHING 

GUSHING 

NASAL 

THIN 

.31 

. 00 

GUTTURAL 

THICK 

INTELLIGIBLE 

DISTINCT 

.99 

.48 

UNINTELLIGIBLE 

GARBLED 

FLUTTERING 

TWITTERING 

-.22 

.04 

SCRATCHING 

SCRAPING 
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CO  imagine  Chac  a listener  will  tend  Co  rate  systems  less 
favorably  when  depressed,  more  favorably  when  elated;  but  is  more 
difficult  to  conceive  of  how  his  mood  would  affect  his  judgments 
of  "continuous  vs.  interrupted,"  "natural  vs.  unnatural"  or 
"rushing  vs.  babbling." 

In  3umm£ry.  the  validity  of  QUART  whether  employed 
isometrically , parametrically  or  with  a combination  of  the  two 
approaches,  is  attested  to  by  a variety  of  evidence.  What  remains 
to  be  accomplished  is  the  implementation  of  standard  procedures 
for  its  use 

In  the  above  connection  it  would  be  highly  desirable 
to  have  normative  data  for  a more  diverse  sample  of  the 
types  of  degradation  imposed  on  the  speech  signal  by  modern  speech 
processing  and  communication  systems.  Although  a large  number  of 
conditions  have  been  treated  in  the  course  of  QUART  research  to 
date,  they  nevertheless  represent  a relatively  circumscribed  class. 
The  majority  of  these  were  narrow  band  digital  voice  systems 
involving  a limited  number  of  speech  processing  and  coding  algo- 
rithms. Poorly  represented  in  this  sample  were  the  various 
forms  of  noise  and  distortion  typical  of  analog  communication 
systems  operating  in  various  environments.  Before  QUART  is 
standardized'-particular ly  with  respect  to  the  regression  coef- 
ficients used  for  parametric  evaluation,  and  even  with  respect 
to  the  semantic  rating  scales  comprising  the  QUART  rating  form-- 
QUART  data  for  such  conditions  must  become  available.  In  this 
connection  it  should  be  emphasized  again  that  the  set  of  semantic 
ratings  sc.i  ics  used  in  Systems  Rating  Form  III  was  optimized  for 
discrimination  within  the  particular  sample  system-conditions 
available  at  the  time.  A different  set  will  undoubtedly  be 
required  to  render  QUART  more  generally  applicable.  However,  the 
manner  in  which  this  issue  is  resolved  is  unlikely  to  affect  the 
validity  and  reliability  of  QUART  acceptability  ratings,  so  long 
as  the  listener  is  required  to  attend  closely  to  a variety  of 
perceptually  relevant  system  characteristics  before  making  an 
acceptability  rating. 
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Secondly,  it  would  be  very  desirable  to  obtain 
normative  QUART  data  from  other  segments  of  the  population  of 
military  communication  system  users,  for  example,  from  users 
in  command  and  staff  position.  In  the  meantime,  QUART,  used 
only  in  the  isometric  mode  with  properly  selected  probes  and 
anchors,  can  provide  a highly  reliable,  valid  and  cost  effec- 
tive means  of  practical  system  evaluation  from  the  standpoint 
of  overall  acceptability. 


6.0 


FURTHER  VALIDATION  OF  PART^  AND  QUART 


A factor  which  complicated  the  task  of  validating 
PARl'l  and  QUART  within  the  term,  proper,  of  this  project  was  the 
unavailability  of  a sufficient  amount  of  correlated  PARM  and 
QUART  data.  Part  of  the  problem  was  that  acceptability  ratings 
by  the  target  sample  could  be  obtained  only  for  a small  and 
questionably  representative  sub-sample  of  the  total  sample  of 
system-conditions  ultimately  evaluated  with  PARM.  QUART  data 
tor  the  remaining  system-conditions  were  not  available  for 
either  the  target  sample  or  professional  listener  sample.  Fortu- 
nately, however,  taped  materials  in  QUART  format  for  a sample 
of  101  system-conditions  were  made  available  to  Dynastat  after 
the  formal  completion  of  work  on  the  project. 

Dynastat  undertook  the  performance  of  QUART  eval- 
uations of  these  101  conditions  on  its  own  volition.  This  made 
available  a set  of  correlated  QUART  and  PARM  data  subiect  to 
identification  by  DCA  of  the  systems  for  which  PARM  evaluations 
had  been  conducted  under  Contract  No.  DCA100-75-C-0034 . Com- 
pletion of  these  QUART  evaluations,  under  Dynastat 's  auspices 
made  it  possible  to  test  more  fully  the  cross  predictability  of 
PARM  and  QUART  rating.  For  this  set  of  system-conditions  the 
coefficient  of  correlation  was  found  to  be  .94.  Figure  6.1 
shows  this  correlation  in  graphic  form.  The  correlation  appears 
to  be  somewhat  lower  than  that  previously  obtained  for  a sample 
of  system-conditions  with  no  bit  errors  and  with  5%  bit  errors. 

In  this  connection  it  should  bo  recalled  that  all  PARM  data  were 
corrected  for  long  term  adaptation  level  drift  on  the  basis  of 
an  empirically  derived  algorithm.  There  is  little  question  but 
what  this  algorithm  was  less  than  totally  efficacious.  But  for 
this  complication  a higher  correlation  would  undoubtedly  have 
been  obtained. 
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Figure  6.1.  Correia 


It  is  clear,  in  any  event,  that  FARM  and  QUART 
measure  essentially  the  same  aspects  of  listener  reaction  to 
processed  speech.  With  adequate  control  of  listener  factor, 
both  can  provide  highly  reliable  and  valid  indicants  of  the 
acceptability  of  voice  communications  equipment. 
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APPENDIX 

I.  PRODUCTION  OF  MASTER  TAPES 

In  accordance  with  contract  specifications,  Dynastat 
prepared  master  tape  recordings  of  both  DRT  and  acceptability 
test  materials 

Description  of  Speech  Materials 

The  Diagnostic  Rhyme  Test  (DRT)  is  a two-cho;cj  test 
of  consonant  discriminability  or,  more  accurately,  a test  of 
the  apprehensibility  of  the  speake-'s  intent  with  respect  to 
the  states  of  six  elementary  attr  .n'*  is  of  conscr.ant  'Phonemes 
(Voiers,  et  al,  1973).  It  yields  a gross  indicant  o-  speech 
intelligibility  and  additional  scores  relating  to  specific 
aspects  of  the  performance  of  the  speaker,  listener  or  system 
under  test  and  it  utilizes  a corpus  of  192  words  (96  rhyming 
pairs).  In  a given  instance,  the  lis  ener's  task  is  to  indicate 
which  member  of  the  pair  has  actually  been  spoken,  A correct 
choice  indicates  that  the  listener  has,  in  effect,  aporehended 
the  speaker's  intent  as  to  the  state  of  one  of  six  essentially 
binary  perceptual  attributes  of  English  consonant  phonemes.  An 
incorrect  choice  indicates  that  the  speaker,  listener  or  sys- 
tem under  test  has  failed  to  distinguish  the  source  state  of 
the  attribute.  Depending  on  the  word  pair  involved,  each  item 
tests  for  the  apprehensibility  of  one  of  the  following  elemen- 
tary phonemic  attributes: 


Voicing 

Nasality 

Sustention 

Sibilation 

Graveness 

Compactness 
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The  DRT  contains  sixteen  items,  or  word  pairs,  to  test  the 
apprehensibility  of  each  attribute,  and  the  two  states  of  each 
attribute  are  given  equal  representation  in  the  test.  Table  1 
shows  the  corpus  of  stimulus  words  used  in  the  present  version 
(Form  IV)  of  the  Diagnostic  Rhyme  Test. 

The  speech  materials  for  acceptability  test  record- 
ings consisted  of  900  six-syllable  sentences,  600  declarative 
sentences  and  300  interrogative.  Sentences  were  constructed 
to  meet  the  following  criteria:  at  least  one  of  the  six-syllables 
cc'.’ained  a vowel  from  each  of  the  categories  shown  in  Table  2 
and  each  sentence  contained  at  least  one  consonant  from  each  of 
the  categories  shown  in  Table  3. 

Recording  Master  Tapes 

The  speaker  was  seated  in  a Tracoustics  single  wall 
sound  room  10'  x 10'  8".  Scotch  206  half-inch,  magnetic  record- 
ing tape  was  u.sed  with  an  Ampex  4A0B  4-track  tape  recorder,  which 
was  located  outside  of  the  sound  room. 

Tapes  were  recorded  at  a speed  of  1.5  ips.  with  peak 
recording  levels  not  exceeding  a 0.5%  harmonic  distortion  thres- 
hold and  an  overall  signal-to-noise  ratio  of  at  least  55  dB. 
National  Association  of  Broadcasters  equaliTiation  standards  were 
observed  for  recording  and  playback. 

Quiet  Environment  Recordings 

In  the  quiet  environment  two  full  list  (384  words) 

DRTs  and  a set  of  90  acceptability  sentences  were  recorded  for 
each  speaker  shown  in  Table  4.  The  microphones  used  and  their 
respective  channels  were  as  follows; 


TABLE  1.  CORPUS  OF  STIMULUS  ITEMS  USED 
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VOICING 

DAUNT-TAUNT 

ZED-SAID 

DINT-TINT 

VOLE -FOAL 

BOND -POND 

VAST- FAST 

BEAN-PEEN 

ZOO-SUE 

VAULT -FAULT 

DENSE-TENSE 

GIN-CHIN 

GOAT-COAT 

JOCK-CHOCK 

GAFF-CALF 

VEAL-FEEL 

DUNE -TUNE 

SIBILATION 
JAB-GAB 
CHEEP-KEEP 
CHEW -COO 
SAW- THAW 
JEST-GUEST 
SING -THING 
JOE -GO 
CHOP-COP 
SANK- THANK 
ZEE-THEE 
JUICE-GOOSE 
JAWS-GAUZE 
CHAIR-CARE 
JILT-GILT 
SOLE -THOLE 
JOT- GOT 


IN  THE  DRT  (Form  IV) 


NASALITY 

SUSTENTION 

MOOT- BOOT 

SHEET-CHEAT 

GNAW-DAW 

SHOES-CHOOSE 

NECK-DECK 

THONG -TONG 

NIP-DIP 

FENCE- PENCE 

MOAN -BONE 

VILL-BILL 

KNOCK- DOCK 

THOSE -DOZE 

MAD -BAD 

VOX- BOX 

NEED-DEED 

THAN -DAN 

NEWS-DUES 

VEE-BEE 

MOSS-BOSS 

FOO-POOH 

MEND- BEND 

SHAW -CHAW 

MITT-BIT 

THEN-DEN 

NOTE -DOTE 

THICK-TICK 

MOM- BOMB 

THOUGK-DOUGH 

NAB-DAB 

VON- BON 

MEAT- BEAT 

SHAD-CHAD 

GRAVENESS 

COMPACTNESS 

tot -TOT 

GHOST- BOAST 

,/BANK-DANK 

GOT -DOT 

WEED  REED 

SHAG -SAG 

POOL-TOOL 

YIELD-WIELD 

FOUGHT- THOUGHT 

COOP-POOP 

MET-NET 

CAUGHT- TAUGHT 

BID-DID 

YEN -WREN 

FORE -THOR 

HIT-FIT 

WAD -ROD 

SHOW-SO 

FAD-THAD 

HOP-FOP 

PEAK- TEAK 

GAT -BAT 

MOON-NOON 

KEY-TEA 

BONG -DONG 

YOU- RUE 

PENT-TENT 

YAWL-WALL 

FIN-THIN 

KEG -PEG 

BOWL- DOLE 

GILL-DILL 

Ill 


TABLE  2,  VOWEL  CATEGORIES 


Front 

Mid 

Back 

High 

team  - i 
tip  - I 

tool  - u 
took  - ti 
tone  - o 

Mid 

ton  - A 
bird  - 3 

Low 

ten  - € 
tap  - =5 

talk  - 0 
top  - a 

Sibilants 

TABLE  3. 

CONSONANT  CATEGORIES 

Stops 

Fricatives 

zip  - 

z 

pat  - p 

vat  - V 

sit  - 

s 

top  - t 

for  - f 

chat  - 

r 

1 

bat  - b 

thin  - 0 

shot  - 

r 

dot  - d 

that  - a 

jot  - 

get  - g 

kit  - k 

TABLE  A.  FUNDAMENTAL  FREQUENCY  OF  SPEAKERS 


Low  Pitch 

CH  - 

102 

Hz 

BV  - 

103 

Hz 

MP 

- 200 

Hz 

Average  Pitch 

RH  - 

115 

Hz 

JE  - 

118 

Hz 

JS 

- 236 

Hz 

High  Pitch 

PK  - 

126 

Hz 

LL  - 

133 

Hz 

LS 

- 260 

Hz 
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Microphones 

Altec  Dynamic,  Model  659k,  Serial  # 1431 

Western  Electric,  Model  # Tl 

Grason  Stadler  Throat,  Model  # E7300M 

General  Radio  Ceramic  Studio,  Model  # 1560-P5, 
Serial  //  2180 

The  Altec  microphone  was  placed  approximately  two  inches 
to  the  right  of  the  speaker’s  lips;  the  Western  Electric  micro- 
phone to  the  left  of  the  lips  at  the  same  distance.  The  throat 
microphone  was  taped  to  the  speaker  just  below  the  frontal 
projection  of  the  larynx;  and  the  General  Radio  microphone 
was  suspended  20  cm.  from  the  front  of  the  speaker's  lips,  in 
grazing  position.  Figures  1 and  2 show  the  microphone  placements 
from  two  views . 

Noise  Environment  Recordings 

Three  male  speakers  (CH,  JE , and  RH)  recorded  one  full 
list  DRT  and  90  acceptability  test  sentences  in  each  of  the 
following  noise  conditions: 

1-  Air  Borne  Command  Post  (ABCP)  - 85  dB* 

2.  Helicopter  - 115  dB 

3.  Shipboard  - 82  dB 

4.  Office  - 63  dB 

One  female  speaker  (JS)  recorded  one  full  list  DRT  and  90  sen- 
tences in  the  office  noise  condition  only.  A General  Radio 
Sound  Level  Meter,  Model  1551C,  was  used  for  measuring  the  noise 
level  in  each  condition  (C-weighted) . Figure  3 shows  block  dia- 
grams of  the  equipment  and  the  sound  room  used  in  the  recording 
of  the  noise  environment  conditions. 


Channel 

L 

2 

3 

4 


*SPL  (C-weighted) 
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In  the  ABCP , shipboard,  and  office  noise  environments 
the  following  microphones  were  used; 

Channel  Microphones 

1 Altec  Dynamic.  Model  # 659A,  Serial  # 1A31 

2 Roanwell  Noise  Cancelling 

3 Grason  Stadler  Throat,  Model  # E7300M 

The  microphone  placements,  shovm  in  Figures  4 and  5,  were  the 

same  as  in  the  quiet  environment  with  the  exception  that  the 
Roanwell  microphone  was  within  one-half  inch  of  the  lips. 

Rudmose  headphones,  RA-125  with  TDH-39  elements  were  used  for 
ear  protection,  as  well  as  for  carrying  a feedback  signal  c 
the  speaker. 

For  the  helicopter  noise  environment  an  Electrovoice 
M-78/AIC  Dynamic  microphone  replaced  the  Roanwell.  The  helicopter 
microphone,  the  Gentrex  helicopter  helmet  Model  SPH-4,was  used  to 
protect  the  speaker's  ears  and  provide  a feedback  signal  in  the 
115  dB  environment.  Microphone  placement  for  the  helicopter 
noise  condition  is  shown  in  Figure  6. 

Editing  and  Quality  Control 

After  recording  the  full  list  DRTs  and  acceptability 
test  materials,  tapes  were  edited  and  assembled  for  evaluation 
by  the  listening  crew.  Full  test  DRTs  were  presented  to  the 
crew,  scored,  and  the  results  carefully  analyzed.  Tapes  were 
re-edited  and  evaluated  again  by  the  listening  crew.  Three- 
speaker  test  modules  were  then  assembled  into  their  final 
format . 
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Fig.  A Microphone  Placement  in  Noise  Condition 

(Front  View) 


Fig,  5 Microphone  Placement  in  Noise  Condition 

(Back  View) 
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Acceptability  test  materials  were  presented  to  a 
listening  crew  to  verify  the  correctness  and  quality  of  the 
sentence  recordings,  Nine-speaker  master  sentence  tapes  were 
then  assembled. 
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II.  ANALOG  COPIES 

All  copies  of  analog  tape  recordings  required  by  the 
contract  were  delivered.  Tape  recorders  used  in  making  the 
recordings  were  two  Ampex  440B  4-Track  recorders,  one  TEAC 
7030  GSL  2-Track  recorder,  and  one  Ampex  602.2  2-Track  recorder 
Scotch  208  magnetic  recording  tape  was  used.  Tables  4 and  5 
provide  a summary  of  analog  tapes  delivered. 
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III.  ANALOG  TO  SEVEN  TRACK  DIGITAL  CONVERSION 


As  an  intermediate  step  in  producing  nine-track 
digital  versions  of  the  master  tapes  a seven-track  digital 
tape  was  recorded.  Seven-track  tapes  were  recorded  on  one 
half  inch  digital  tape  at  800  bytes  per  inch  NRZI  in  ASCII 
code  and  format.  Digital  sampling  was  at  12,000  Hz,  with 
each  sample  digitally  represented  in  two's  compliment  format 
by  at  least  11  bits  plus  a sign  bit.  The  speech  signal 
amplitude  range  was  set  at  + 5 volts  peak.  Figure  7 shows  a 
block  diagram  of  the  equipment  used  in  the  analog  to  digital 
conversion.  Table  6 provides  a summary  of  seven-track  tapes 
delivered. 
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TTY,  MODEL 
TELETYPE 


AMPEX  440-B  WHITE  LOWPAS5  HP  350D  DDP-116 

TAPE  RECORDER  FILTER  AHENUATOR  A/D  SYSTEM 


if 


CONTROL  DATA 
7 TRACK  TAPE 
DRIVE 


Figure  7.  EQUIPMENT  SET  UP  FOR  ANALOG  TO  SEVEN  TRACK  CONVERSION. 


TABLE  6 (1) 


SEVEN  TRACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mic. 

Environment 

Place  j 

EIA1/E1A2 

LL 

M 

302A 

8/24/74 

GR 

Quiet 

Dynastat 

CH 

M 

308B 

8/29/74 

H 

1 1 

1 1 

E1A3/E1B1 

RH 

M 

310A 

9/04/74 

( 1 

If 

11  1 

1 

JE 

M 

306A 

9/05/74 

1 1 

1 1 

" 1 

E1B2/E1B3 

BV 

M 

303A 

9/24/74 

11 

1 1 

1 1 ' 

PK 

M 

309A 

9/23/74 

11 

11 

1 1 

E2A1/E2A2 

LL 

M 

302B 

8/24/74 

1 1 

11 

11 

CH 

M 

307A 

8/29/74 

1 ) 

1 1 

11 

E2A3 

RH 

M 

310B 

9/04/74 

M 

ft 

1 f 

E2B1 

JE 

M 

306B 

9/05/74 

11 

11 

11 

E2B2/E2B3 

BV 

M 

303B 

9/24/74 

11 

1 1 

1 1 

PK 

M 

312B 

9/23/74 

11 

1 1 

1 1 

E3AL/E3A2 

LL 

M 

301A 

8/25/74 

11 

11 

1 1 

CH 

M 

3 08 A 

8/29/74 

11 

11 

11 

E3A3/E3B1 

RH 

M 

311A 

9/04/74 

II 

1 1 

1 1 

JE 

M 

305A 

8/28/74 

11 

1 1 

1 1 

E3B2/E3B3 

BV 

M 

304A 

9/24/74 

It 

11 

1 1 

PK 

M 

312A 

9/23/74 

1 1 

1 1 

1 1 

E4A1/E4A2 

LL 

M 

301B 

8/25/74 

1 1 

1 1 

1 1 

CH 

M 

307B 

8/29/74 

It 

II 

1 1 

E4A3/E4B1 

RH 

M 

311B 

9/04/74 

11 

11 

1 1 

JE 

M 

305B 

8/24/74 

1 1 

1 1 

11 

E4B2/E4B3 

BV 

M 

304B 

9/24/74 

1 t 

1 1 

1 1 

PK 

M 

309P 

9/23/74 

If 

1 1 

1 1 

E5Ai 

JS 

F 

317A 

8/30/74 

1 1 

11 

11 

E5A2 

LS 

F 

315A 

9/20/74 

1 1 

1 1 

11 

E5A3/E5B1 

MP 

F 

314A 

9/21/74 

1 1 

11 

1 1 

JS 

F 

3l7b 

8/30/74 

1 1 

1 1 

11 

E5B2 

LS 

F 

315B 

9/20/74 

11 

11 

11 

E5B3 

MP 

F 

314B 

9/21/74 

11 

11 

1 1 

£ 
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TABLE  G (2) 


SEVEN  TRACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mic. 

Environment 

Place 

E6A1 

JS 

F 

318A 

8/30/74 

GR 

Quiet 

Dynastat 

E6A2 

LS 

F 

316A 

9/05/74 

II 

1 1 

11 

E6A3/E6B1 

hP 

F 

313A 

9/2L/74 

11 

II 

1 1 

JS 

F 

318B 

8/30/74 

1 1 

1 1 

It 

E6B2/E6B3 

LS 

F 

316B 

9/05/74 

n 

II 

1 1 

MP 

F 

313B 

9/21/74 

11 

It 

II 

TABLE  6 (3)  SEVEN  TRACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mtc. 

Environment 

Place 

E1A1/E1A2 

LL 

M 

302A 

8/24/74 

Carbon 

Quiet 

Dynastat 

CH 

M 

308B 

8/29/74 

M 

II 

tl 

E1A3/E1B1 

RH 

M 

310A 

9/04/74 

1 1 

tt 

1 1 

JE 

M 

306A 

9/05/74 

tt 

1 1 

1 1 

E1B2/E1B3 

BV 

M 

303A 

9/24/74 

tt 

1 1 

1 1 

PK 

M 

309a 

9/23/74 

1 1 

II 

It 

E2A1/E2A2 

LL 

M 

302B 

8/24/74 

1 1 

II 

It 

CH 

M 

307A 

8/29/74 

tl 

1 1 

f t 

E2A3/E2B1 

RH 

M 

310B 

9/04/74 

M 

1 1 

1 1 

JE 

M 

306B 

9/05/74 

It 

ft 

1 1 

E2B2/E2B3 

BV 

M 

303B 

9/24/74 

It 

It 

1 1 

PK 

M 

312B 

9/23/74 

It 

1 1 

tl 

E3A1/E3A2 

LL 

M 

301A 

8/25/74 

1 1 

It 

1 1 

CH 

M 

308a 

8/29/74 

1 1 

tt 

1 1 

E3A3/E3B1 

RH 

M 

311A 

9/04/74 

1 1 

1 1 

M 

JE 

M 

305A 

8/28/74 

1 1 

tl 

1 1 

E3B2/E3B3 

BV 

M 

304A 

9/24/74 

1 1 

tl 

It 

PK 

M 

312A 

9/23/74 

tl 

It 

1 1 

E4A1/E4A2 

LL 

M 

301B 

8/25/74 

It 

tl 

1 1 

CH 

M 

307B 

8/29/74 

1 1 

II 

It 

E4A3/E4B1 

RH 

M 

311B 

9/04/74 

11 

II 

1 1 

JE 

M 

305B 

8/24/74 

M 

II 

1 1 

E4B2/E4B3 

BV 

M 

304B 

9/24/74 

tl 

It 

1 1 

PK 

M 

309B 

9/23/74 

1 1 

It 

1 1 

E5A1 

JS 

F 

317A 

8/30/74 

1 1 

1 1 

It 

E5A2 

LS 

F 

315A 

9/20/74 

1 1 

1 1 

1 1 

E5A3/E5B1 

MP 

F 

314A 

9/21/74 

1 t 

It 

1 1 

JS 

F 

317B 

8/30/74 

1 1 

1 1 

1 1 

E5B2 

LS 

F 

315B 

9/20/74 

1 1 

1 1 

1 1 

E5B3 

MP 

F 

314B 

9/21/74 

1 1 

1 1 

1 1 

E6A1 

JS 

F 

318A 

8/30/74 

1 1 

1 1 

1 1 

E6A2 

LS 

F 

316A 

9/05/74 

1 1 

1 1 

1 1 
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TABLE  6 (4) 


SEVEN  TRACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mic. 

Environment 

Place 

E6A3/E6B1 

MP 

F 

313A 

9/21/74 

Carbon 

Quiet 

Dynactat 

JS 

F 

3i8B 

8/30/74 

tt 

1 1 

1 1 

E6B2/E6B3 

LS 

F 

316B 

9/05/74 

It 

II 

1 1 

MP 

F 

313B 

9/21/74 

II 

tl 

tl 

G1A1/G1A2 

RH 

M 

318A 

9/07/74 

Altec 

ABCP 

11 

JE 

M 

310A 

9/14/74 

It 

It 

1 ; 

G1A3/G1B1 

CH 

M 

314A 

9/07/74 

It 

1 1 

1 1 

RH 

M 

318B 

9/07/74 

H 

II 

1 1 

G1&2/G1B3 

JE 

M 

310B 

9/14/74 

M 

II 

1 1 

CH 

M 

314B 

9/07/74 

1 1 

II 

1 1 

G3A1 

RH 

M 

303A 

9/11/74 

1 1 

Shipboard 

1 1 

G3A2 

JE 

M 

311A 

9/15/74 

1 1 

1 1 

1 1 

G3A3 

CH 

M 

315a 

9/12/74 

II 

1 1 

1 1 

G3BI 

RH 

M 

303B 

9/11/74 

11 

1 1 

1 1 

G3B2 

JE 

M 

311B 

9/15/74 

II 

1 1 

1 1 

G3B3 

CH 

M 

315B 

9/12/74 

II 

1 1 

1 t 

G4A1 

RH 

M 

304A 

9/15/74 

Roan we 11 

Office 

1 t 

G4A2 

JE 

M 

312A 

9/15/74 

II 

1 1 

1 1 

G4A3/G4A4 

CH 

M 

316A 

9/15/74 

II 

1 1 

1 1 

JS 

F 

305A 

9/16/74 

II 

1 1 

1 1 

G4B1 

RH 

M 

304B 

9/15/74 

II 

1 1 

1 1 

G4B2 

JE 

M 

312B 

9/15/74 

It 

1 1 

1 1 

G4B3/G4B4 

CH 

M 

316B 

9/15/74 

11 

1 1 

tl 

JS 

F 

305B 

9/16/74 

It 

11 

1 1 
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IV.  CONVERSION  OF  SEVEN-TRACK 
DIGITAL  TAPES  TO  NINE-TRACK  FORMAT 

Seven-track  digital  tapes  were  converted  to  nine- 
track  digital  format  via  a Dynastat  written  FORTRAN  program 
on  a Data  General  NOVA  2/10  computer  system.  Sixteen  bit 
data  words  were  constructed  to  include  a twelve  bit  sample 
plus  four  sync  bits  as  specified  in  the  Statement  of  Work. 
Records  were  1000  words  each  (2000  bytes) . Nine-track  tapes 
were  written  in  even  parity  at  800  bytes  per  inch.  Each  tape 
file  is  prefaced  by  a header  record  which  specifies  various 
analog  recording  data  including:  type  of  analog  material 
(i.e.,  DRT  scrambling,  acceptability  test  sentence,  tape 
announcement,  speaker  announcement,  or  calibration  tone) 
microphone  informatipn,  speaker  identification,  recording 
dates  and  other  data  as  outlined  in  subject  Statement  of  Work. 
A summary  of  the  nine-track  digital  tapes  delivered  by  Dynasta 
is  shown  in  Table  7. 


TABLE  7 (1)  NINE  TPACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mic. 

Environment 

Place 

E1A1/E1A2 

LL 

M 

302A 

8/2A/7A 

Altec 

Quiet 

Dynastat 

CH 

M 

3080 

8/29/7A 

It 

1 1 

1 1 

E1A3/E1B1 

RH 

M 

310A 

9/OA/7A 

1 1 

tt 

1 1 

JE 

M 

306A 

9/05/7A 

M 

II 

1 1 

EIB2/E1B3 

BV 

M 

303A 

9/2A/7A 

11 

tl 

1 ( 

PK 

M 

309A 

9/23/7A 

II 

It 

1 1 

E2A1/E2A2 

LL 

M 

302B 

8/2A/7A 

f 1 

II 

1 1 

CH 

M 

307A 

8/29/7A 

• 1 

It 

1 1 

E2A3/E2BI 

RH 

M 

310B 

9/OA/7A 

tl 

• 1 

tt 

JE 

M 

306B 

9/05/7A 

It 

ft 

II 

E2B:/E2B3 

BV 

M 

303B 

9/2A/7A 

tl 

1 1 

PK 

M 

312B 

9/23/7A 

tl 

1 1 

1 1 

„/E3A2 

LL 

M 

301A 

8/25/7A 

• 1 

It 

1 1 

CH 

M 

308A 

9/29/7A 

II 

It 

1 1 

E3A3/E3B1 

RH 

M 

311A 

9/0A/7A 

It 

• 1 

1 1 

JE 

M 

305A 

8/28/7A 

It 

tl 

It 

E3B2/E3B3 

BV 

M 

30AA 

9/2A/7A 

1 1 

II 

If 

PK 

M 

3i2A 

9/23/7A 

tl 

It 

tt 

EAA1/EAA2 

LL 

M 

301B 

8/25/7A 

tf 

tf 

II 

CH 

M 

307B 

8/29/7A 

It 

1 1 

1 1 

EAA3/EAB1 

RH 

M 

311B 

9/0A/7A 

It 

tl 

It 

JE 

M 

305B 

8/2A/7A 

II 

II 

II 

EAB2/EAB3 

BV 

M 

30AB 

9/2A/7A 

II 

II 

1 1 

PK 

M 

309B 

9/23/7A 

II 

It 

1 1 

E5A1 

JS 

F 

317A 

8/30/7A 

II 

It 

It 

E5A2 

LS 

F 

315A 

9/20/7A 

M 

It 

It 

E5A3 

MP 

F 

31AA 

9/21/7A 

It 

It 

It 

E5B1 

JS 

p 

317B 

8/30/7A 

II 

II 

1 1 

E5B2 

LS 

■F 

315B 

9/20/7A 

II 

II 

tt 

E5B3 

MP 

F 

31AB 

9/21/7A 

tl 

II 

1 1 

E6A1 

JS 

F 

318A 

8/30/7A 

It 

II 

1 1 

E6A2 

LS 

F 

3I6A 

9/05/7A 

It 

II 

1 1 

E6A3 

MP 

F 

313A 

9/21/7A 

• 1 

tl 

1 1 

E6B1 

JS 

F 

318B 

8/30/7A 

M 

ft 

It 

E6B2 

LS 

F 

316B 

9/05/7A 

tl 

It 

1 1 

E6B3 

MP 

F 

313B 

9/21/7A 

11 

It 

ft 
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TABLE  7 (2) 


NINE  TRACK  DIGITAL  TAPES 


Tape 

Speaker 

Sex 

List 

Date 

Mic . Environment 

Place 

GlAl 

RH 

M 

318A 

9/07/74 

Roanwell 

ABCP 

Dynastat 

G1A2 

JE 

M 

310A 

9/14/74 

II 

" 

1 1 

GlAl 

CH 

M 

314A 

9/07/74 

1 * 

1 1 

1 1 

GlBl 

RH 

M 

318B 

9/07/74 

II 

II 

It 

G1B2/G1B3 

JE 

M 

310B 

9/14/74 

II 

II 

II 

CH 

M 

314B 

9/07/74 

II 

It 

II 

G2A1 

RH 

M 

317A 

9/11/74 

Helicopter 

Helicopter 

1 1 

G2A2 

JE 

M 

309A 

9/14/74 

• 1 

II 

1 1 

G2A3 

CH 

M 

313B 

9/12/74 

II 

It 

1 1 

G2B1 

RH 

M 

317B 

9/11/74 

II 

It 

If 

G2B2 

JE 

M 

309B 

9/14/74 

II 

II 

1 1 

G2B3 

CH 

M 

313A 

9/12/74 

31 

tl 

It 

G3A1 

RH 

M 

303A 

9/11/74 

Roanwell 

Shipboard 

It 

G3A2 

JE 

M 

311A 

9/15/74 

II 

II 

II 

G3A3 

CH 

M 

315A 

9/12/74 

II 

II 

M 

G3B1 

RH 

M 

303B 

9/11/74 

M 

1 1 

II 

G3B2 

JE 

M 

311B 

9/15/74 

If 

1 1 

II 

G3B3 

CH 

M 

315B 

9/12/74 

tl 

1 

It 

G4A1 

RH 

M 

304A 

9/15/74 

Altec 

Office 

If 

G4A2 

JE 

M 

312A 

9/15/74 

II 

It 

II 

G4A3 

CH 

M 

316A 

9/15/74 

tl 

II 

1 1 

GAA4 

JS 

F 

305A 

9/16/74 

II 

II 

1 1 

G4B1 

RH 

M 

304B 

9/15/74 

1 r 

I 

1 1 

G4B2 

JE 

M 

312B 

915/74 

1 1 

tl 

t 

G4B3 

CH 

M 

316B 

9/15/74 

It 

tl 

1 1 

G4R4 

JS 

F 

305B 

9/16/74 

II 

tl 

It 
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